I encountered a problem when using the Megatron pipeline. The function I am using is forward_backward_pipelining_without_interleaving. In this pipeline function, each pipeline stage calls forward_step for the forward pass:
The input for the forward pass should be the output from the previous stage. However, in the megatron/schedule.py file, the forward_step function is defined as follows:
This implies that each stage in the forward pass still gets data from the dataset and processes it, which seems to contradict the concept of pipelining. Could you please explain the rationale behind this design?
I encountered a problem when using the Megatron pipeline. The function I am using is forward_backward_pipelining_without_interleaving. In this pipeline function, each pipeline stage calls forward_step for the forward pass:
output_tensor = forward_step(forward_step_func, data_iterator, model, input_tensor, losses_reduced)
The input for the forward pass should be the output from the previous stage. However, in the megatron/schedule.py file, the forward_step function is defined as follows:
unwrapped_model.set_input_tensor(input_tensor) output_tensor, loss_func = forward_step_func(data_iterator, model)
This implies that each stage in the forward pass still gets data from the dataset and processes it, which seems to contradict the concept of pipelining. Could you please explain the rationale behind this design?
code in pretrained_gpy.py:
Here are my results:
My configuration:
GPUS_PER_NODE=4
Change for multinode config
MASTER_ADDR=172.20.20.220 MASTER_PORT=6000 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DATA_PATH=data/my-gpt2_text_document CHECKPOINT_PATH=model/model_optim_rng.pt MODEL_PATH=model/output/pp DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \ pretrain_gpt.py \ --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 4 \ --num-layers 12 \ --hidden-size 1024 \ --num-attention-heads 16 \ --micro-batch-size 16 \ --global-batch-size 64 \ --seq-length 1024 \ --max-position-embeddings 1024 \ --train-iters 1
Feel free to adjust anything as needed before posting!