Open randy-ac opened 2 months ago
Hey Randy, I think some variation is expected as there might be some slight differences in some numerical operations, which can build up along the way. It might be interesting to investigate at various steps at the beginning of the training to see which operations are most impactful. Also, could be interesting to check if significant differences in output values occur at inference as well.
probably unrelated with your issue, nut bear in mind that llama 3 uses Rope scaling which is not implemented in Eole yet.
Hello both,
thanks for your replies. I will check your suggestions as soon as possible and will keep you posted.
Hello, Thanks for your answers. @randy-ac will run and compare the single and parallel gpu mode on a longer setting, without quantization or dropout to avoid “spurious” differences.
Hello everyone, I'm finally back to this topic with the results of some experiments we carried out. We trained two models on 1000 steps, one on 2 GPUs with tensor_parallel mode, and another one on a single gpu. The task is domain classification. The two models had exactly the same configs. We removed quantization and dropout to avoid introducing other variables in the experiment. Please see the configs attached. We've still found that the two models diverge in validation accuracy, output values for the same checkpoint, LM decoder forward, checkpoint sizes. In general, the model trained with tensor_parallel seems to achieve a worse performance.
Tensorboard logs (the red one is the one with tensor_parallel)
Output values We tested checkpoint 400 of both models. There is a ~18% difference between the two accuracy values (i.e. if the predicted label = gold label). Please find attached a tsv with the outputs of each model.
Different values in decoder forward For each layer, we printed out the norm of the layer input and the attention output. It seems that some differences start to build up in layer 3 in the first step (i.e. before the first model backward):
1 GPU Layer nr 3
Layer_in norm: 389.75
norm_layer_in Euclidean Distance to zero: 691.5 attn_output Euclidean Distance to zero: 18.78125
Layer nr 4
Layer_in norm: 391.25
norm_layer_in Euclidean Distance to zero: 620.0
Tensor parallel Layer nr 3
Layer_in norm: 389.75
norm_layer_in Euclidean Distance to zero: 691.5 attn_output Euclidean Distance to zero: 18.796875
Layer nr 4
Layer_in norm: 391.5
norm_layer_in Euclidean Distance to zero: 620.0 attn_output Euclidean Distance to zero: 24.359375
Checkpoint size The sizes (KB) of the 400th checkpoint for the parallel_mode model are: 5 llama3-8b-instruct-parallel-eole-test-long/step_400/config.json 15700311 llama3-8b-instruct-parallel-eole-test-long/step_400/merged 15397 llama3-8b-instruct-parallel-eole-test-long/step_400/model.00.safetensors 64753 llama3-8b-instruct-parallel-eole-test-long/step_400/optimizer.pt 2069 llama3-8b-instruct-parallel-eole-test-long/step_400/vocab.json
The sizes (KB) of the 400th checkpoint for the 1 gpu model are: 5 llama3-8b-instruct-1gpu-eole-test-long/step_400/config.json 15700324 llama3-8b-instruct-1gpu-eole-test-long/step_400/merged 13349 llama3-8b-instruct-1gpu-eole-test-long/step_400/model.00.safetensors 80137 llama3-8b-instruct-1gpu-eole-test-long/step_400/optimizer.pt 2069 llama3-8b-instruct-1gpu-eole-test-long/step_400/vocab.json
Could you please advise? Thanks!
output.csv tensor_parallel_model_configs.json 1gpu_model_configs.json
@randy-ac are you seeing this in your log while training on 2GPU/tensor_parallel ?
/home/vincent/miniconda3/envs/pt2.3/lib/python3.11/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905969073/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/vincent/miniconda3/envs/pt2.3/lib/python3.11/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905969073/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
can you git pull and try again?
Hello Vincent. Thanks for the feedback! I confirm I had that UserWarning in my logs when fine-tuning in tensor_parallel mode. I will re-run my tests on #116 and keep you posted.
Hello,
I re-tested using commit #116 and here are the results.
I confirm I didn't have the UserWarning: c10d::allreduce_: in the logs anymore.
In the Tensor board (the red one is the one fine-tuned in tensor_parallel mode) the two lines are closer to each other than in the previous tests.
However, we still observe the following differences between the model trained on 1 gpu and the one trained on 2 gpus:
Accuracy There is still a difference between the accuracy reached by the two models (i.e. if the predicted label = gold label). The model trained on 1 gpu has an accuracy higher than the tensor_parallel model by ~14%.
Checkpoint size The size (in KB) of the 400th checkpoint of both models are quite different
1 GPU 15700315 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/merged 13349 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/model.00.safetensors 80137 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/optimizer.pt 2069 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/vocab.json
Tensor parallel 15700315 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/merged 15397 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/model.00.safetensors 64761 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/optimizer.pt 2069 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/vocab.json
Different values in decoder forward For each layer, we printed out the norm of the layer input and the attention output. There is still a small difference in layer 3. Parallel mode Layer nr 3
Layer_in norm: 390.0
norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625
Layer nr 3
Layer_in norm: 390.0
norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625
1 gpu Layer nr 3
Layer_in norm: 389.75
norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625
The gap closes up at layer 20. From then on, the size of the tensors are the same for the two modes until the end of the training.
I am attaching the two fine-tuning configs.
Thanks! tensor_parallel_config_latest.json 1gpu_config_latest.json
There might still be a bug or something but not so easy to track unless looking at a step by step calculation on each operation (maybe using a single example and some printouts)
Hello,
we have noticed some unexpected behaviors when fine-tuning a llama 3 model on 1 gpu and when fine-tuning the same model on the same data set with 2 gpus in parallel mode. See the attached tensorboard graphs (red=run with parallel mode). The minimal validation ppl is different between the two runs.
As you can see from the configs I am pasting below, the only parameters that differ between the runs are: world_size, gpu_rank and parallel_mode.
Could you please advise?
Configs for run with 1 GPU
Configs for run with parallel_mode