Closed l-k-11235 closed 1 month ago
After an investigation with Lina, it seems that the problem is related to the renaming of https://github.com/vince62s/eole/blob/bbd620c8be47c2ab51c1d0b64e35d737352d1087/eole/modules/transformer_mlp.py#L47 without changing the string https://github.com/eole-nlp/eole/blob/3a9b137b7e063e4ce9cebbfa4e842f56aa2af555/eole/models/model.py#L552-L558
We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.
Do you have a space to share a discussion about the strategy and the goal of this repo ?
We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.
this is not happening so often, I will fix the few places where I missed those
Do you have a space to share a discussion about the strategy and the goal of this repo ?
I suggest the Discussions tab of this repo
see #30 let me know if it fixes completely the issue.
It seems to work well, with 2 gpus I now have this in the logs :
[2024-06-14 06:52:41,726 INFO] Starting training on GPU: [0, 1]
[2024-06-14 06:52:41,726 INFO] Start training loop and validate every 200 steps...
[2024-06-14 06:52:41,727 INFO] Scoring with: {'insert_mask_before_placeholder': InsertMaskBeforePlaceholdersTransform(), 'onmt_tokenize': ONMTTokenizerTransform(share_vocab=True, src_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, src_onmttok_kwargs={'mode': 'none'}, tgt_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, tgt_onmttok_kwargs={'mode': 'none'}), 'filtertoolong': FilterTooLongTransform(src_seq_length=512, tgt_seq_length=512)}
This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2024-06-14 06:52:43,949 INFO] Weighted corpora loaded so far:
* cred_dataset: 1
[2024-06-14 06:52:44,387 INFO] Weighted corpora loaded so far:
* cred_dataset: 1
[2024-06-14 06:53:27,376 INFO] Step 10/20000; acc: 49.2; ppl: 249.85; xent: 5.52; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 365/ 114/ 1; 2561/800 tok/s; 46 sec;
[2024-06-14 06:54:08,233 INFO] Step 20/20000; acc: 50.2; ppl: 189.43; xent: 5.24; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 375/ 117/ 1; 2937/916 tok/s; 87 sec;
[2024-06-14 06:54:48,808 INFO] Step 30/20000; acc: 51.1; ppl: 152.03; xent: 5.02; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 375/ 117/ 1; 2958/923 tok/s; 127 sec;
My finetuning logs show very different statistics depending on whether finetuning is run in single or multi-gpu mode.. For instance;