Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
441 stars 72 forks source link

[QUESTION] Memory footprint #208

Closed vince62s closed 3 months ago

vince62s commented 3 months ago

Trying to train with examples (trainer.yaml, unified_metric.yaml) I am facing some memory issues.

I changed "precision: 16-mixed" in trainer.yaml but it does not help. It starts ok with a low memory footprint until the first 0.3 epoch, then it unfreezes the encoder and run into OOM very quickly.

Are we supposed to be able to train with XLMRoberta-large on a 24GB card ? (I had to reduce the batch size to 1 but the it is way to slow and un-optimized)

ricardorei commented 3 months ago

You can train an XLMRoberta-large on a 24GB if you keep the embeddings frozen. XLM-R embeddings take a lot of space. But keeping them frozen has no impact on performance and reduces memory a lot.

vince62s commented 3 months ago

How do I do that? is it documented somewhere? EDIT: found it but they were frozen already ....

also is the exact same 1720-da.csv dataset downladable somewhere? cause I am running tests independently wit 17, 18, 19 but with 20 I am getting this error: (encoder is miniLM)

Loading data/2020-da.csv.
Epoch 0: 30%|█████████████████████████████████████████████████▊ | 3167/10555 [01:59<04:39, 26.45it/s, v_num=0]Encoder model fine-tuning Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10555/10555 [13:16<00:00, 13.26it/s, v_num=0/home/vincent/miniconda3/envs/pt2.1.0/lib/python3.11/site-packages/scipy/stats/_stats_py.py:5445: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.████| 13/13 [00:00<00:00, 37.22it/s] warnings.warn(stats.ConstantInputWarning(warn_msg)) /home/vincent/miniconda3/envs/pt2.1.0/lib/python3.11/site-packages/scipy/stats/_stats_py.py:4781: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. warnings.warn(stats.ConstantInputWarning(msg)) Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10555/10555 [13:18<00:00, 13.22it/s, v_num=0, val_kendall=nan.0, val_spearman=nan.0, val_pearson=nan.0]Epoch 0, global step 1320: 'val_kendall' reached -inf (best -inf), saving model to '/home/vincent/nlp/COMET/lightning_logs/version_0/checkpoints/epoch=0-step=1320-val_kendall=nan.ckpt' as top 5

ricardorei commented 3 months ago

Also you should use precision at 16.

To keep embeddings frozen just keep this flag at true.

ricardorei commented 3 months ago

Hmm that seems like either the model is not converging or your ground truth is all the same scores.

You can find the data here

vince62s commented 3 months ago

embedding frozen is already set to true in the unified_metric.yaml so not helping. When I set precision: 16, I am getting a warning saying it's better to use 16-mixed for AMP. I'll try 16 but I think I got an error with 16 only.

dmar1n commented 3 months ago

Hi @vince62s,

The precision value I currently use to avoid the warning is 16-mixed (following this). Also, you might want to try with nr_frozen_epochs: 1.0 and a bigger value for accumulate_grad_batches.

Hope this helps.

vince62s commented 3 months ago

Hmm that seems like either the model is not converging or your ground truth is all the same scores.

it's the plain wmt 2020 da csv file

vince62s commented 3 months ago

Hi @vince62s,

The precision value I currently use to avoid the warning is 16-mixed (following this). Also, you might want to try with nr_frozen_epochs: 1.0 and a bigger value for accumulate_grad_batches.

Hope this helps.

the memory issue appears as soon as the encoder is no longer frozen. so to test (to avoid waiting) I put nr_frozen_epochs=0.0 so that I see rigth away if things fit in the vRam. with precision:16 / batch_size 4 we are a the very limit of 24GB, would be a pity if it crashes. there could be twp nice options: 1) have a filtertoolong catch to exclude examples taht are very long and tirgger this, and 2) a try/except when it goes OOM so that it can discard the batch and continue.

vince62s commented 3 months ago

Hmm that seems like either the model is not converging or your ground truth is all the same scores.

it's the plain wmt 2020 da csv file

ok, with miniLM learning_rate needs to be much higher and it works fine with 2020 data.

vince62s commented 3 months ago

You can find the data here

@ricardorei can you share the script that computes those csv files ? I would like to redo the same but exclude some specific systems, or do you have the same with the system name as a column?

ricardorei commented 3 months ago

I actually found the notebooks I used...but I did not saved the data. Just the raw notebooks. They should help you redo the data

Archive.zip

ricardorei commented 3 months ago

they also point to the previous WMT websites where you can download the data.

vince62s commented 3 months ago

Thanks, in the meantime I managed to do it for wmt2021. I was able to exclude one system but it gives me the same results.

I still have an issue with wmt22 data whatever the learning rate when training only on those data it does not converge.

ricardorei commented 3 months ago

You mean the DA's from WMT 22? Some years of WMT are known to have very noisy DA's. For WMT 22 I would not use the DA's... For WMT you have the SQM data or the MQM from the metrics task... the DA's from WMT 2022 were collected only into english and are known to be noisy.

vince62s commented 3 months ago

but here: https://huggingface.co/datasets/RicardoRei/wmt-da-human-evaluation it has some 2022 data, is it DA or something else? I trained on the 2022 extract from there so must be DA

ricardorei commented 3 months ago

yes exactly. Its those DA's from WMT 22

ricardorei commented 3 months ago

Usually I only use DA's from 2017 to 2020. Even those from 2021 I don't trust too much

vince62s commented 3 months ago

but do you have the exact data set used for wmt23-cometkiwi-da-xl and for wmt22-cometkiwi-da ?

ricardorei commented 3 months ago

Yes I do. Let me download it and I'll share here.

Its basically WMT 17 to 20 + MLQE-PE data.

ricardorei commented 3 months ago

Its too big. Ill share it by email

vince62s commented 3 months ago

closing this but training with xlmroberta large or XL is still an issue with 24GB vram maybe using Lora would help and be the solution.