The finetuning in tensor parallel mode does not work as expected

l-k-11235 commented 1 month ago

My finetuning logs show very different statistics depending on whether finetuning is run in single or multi-gpu mode.. For instance;

With 2 gpus:

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,451 INFO] Weighted corpora loaded so far:
            * train_dataset: 1
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,648 INFO] Weighted corpora loaded so far:
            * train_dataset: 1
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[2024-06-10 15:24:15,087 INFO] Step 10/20000; acc: 21.6; ppl: 1004.81; xent: 6.91; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1076/ 357/ 1; 2622/868 tok/s;     66 sec;
[2024-06-10 15:25:14,439 INFO] Step 20/20000; acc: 44.4; ppl: 77.83; xent: 4.35; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1037/ 341/ 1; 2795/920 tok/s;    125 sec;
[2024-06-10 15:26:14,764 INFO] Step 30/20000; acc: 45.3; ppl: 47.23; xent: 3.86; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1045/ 344/ 1; 2773/912 tok/s;    185 sec;

With 1 gpu:

[2024-06-10 10:09:14,672 INFO] Step 10/20000; acc: 50.1; ppl: 227.58; xent: 5.43; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  370/ 115/ 1; 1616/503 tok/s;     37 sec;
[2024-06-10 10:09:48,438 INFO] Step 20/20000; acc: 49.3; ppl: 208.37; xent: 5.34; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  361/ 113/ 1; 1710/535 tok/s;     70 sec;
[2024-06-10 10:10:23,572 INFO] Step 30/20000; acc: 50.2; ppl: 162.91; xent: 5.09; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  379/ 118/ 1; 1726/536 tok/s;    106 sec;

funboarder13920 commented 1 month ago

After an investigation with Lina, it seems that the problem is related to the renaming of https://github.com/vince62s/eole/blob/bbd620c8be47c2ab51c1d0b64e35d737352d1087/eole/modules/transformer_mlp.py#L47 without changing the string https://github.com/eole-nlp/eole/blob/3a9b137b7e063e4ce9cebbfa4e842f56aa2af555/eole/models/model.py#L552-L558

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

Do you have a space to share a discussion about the strategy and the goal of this repo ?

vince62s commented 1 month ago

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

this is not happening so often, I will fix the few places where I missed those

Do you have a space to share a discussion about the strategy and the goal of this repo ?

I suggest the Discussions tab of this repo

vince62s commented 1 month ago

see #30 let me know if it fixes completely the issue.

l-k-11235 commented 1 month ago

It seems to work well, with 2 gpus I now have this in the logs :

[2024-06-14 06:52:41,726 INFO] Starting training on GPU: [0, 1]
[2024-06-14 06:52:41,726 INFO] Start training loop and validate every 200 steps...
[2024-06-14 06:52:41,727 INFO] Scoring with: {'insert_mask_before_placeholder': InsertMaskBeforePlaceholdersTransform(), 'onmt_tokenize': ONMTTokenizerTransform(share_vocab=True, src_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, src_onmttok_kwargs={'mode': 'none'}, tgt_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, tgt_onmttok_kwargs={'mode': 'none'}), 'filtertoolong': FilterTooLongTransform(src_seq_length=512, tgt_seq_length=512)}

This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2024-06-14 06:52:43,949 INFO] Weighted corpora loaded so far:
            * cred_dataset: 1
[2024-06-14 06:52:44,387 INFO] Weighted corpora loaded so far:
            * cred_dataset: 1
[2024-06-14 06:53:27,376 INFO] Step 10/20000; acc: 49.2; ppl: 249.85; xent: 5.52; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  365/ 114/ 1; 2561/800 tok/s;     46 sec;
[2024-06-14 06:54:08,233 INFO] Step 20/20000; acc: 50.2; ppl: 189.43; xent: 5.24; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2937/916 tok/s;     87 sec;
[2024-06-14 06:54:48,808 INFO] Step 30/20000; acc: 51.1; ppl: 152.03; xent: 5.02; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2958/923 tok/s;    127 sec;

eole-nlp / eole

The finetuning in tensor parallel mode does not work as expected #18