Closed JiunHaoJhan closed 2 years ago
Hi @JiunHaoJhan , greetings. This might work: Under test.trainer_args
, try adding eval_accumulation_steps
as a key and some small amount as the value.
See the training arg here: https://github.com/huggingface/transformers/blob/3981ee8650042e89d9c430ec34def2d58a2a12f7/src/transformers/training_args.py#L155
Please keep us posted if that works.
+1 to what @jgmf-amazon said. Here is an example of that in use from our mt5-base-enc run
It works! Thanks a lot.
Great!
Hi,
I success to train an xlmr-base-20220411 Encoder Model according to README. When I tried to infer on the test set, the GPU memory usage keeps growing and cause a CUDA out-of-memory issue in the end.
Even though I set the batch size to 1, the GPU memory usage is still creeping up and causing a CUDA out-of-memory issue. Could you help to figure it out?
Here is the command I use to infer the test set.
torchrum --nproc_per_node=4 scripts/test.py -c examples/xlmr_base_test_20220411.yml
Here is the content of the config file
xlmr_base_test_20220411.yml
Note: I have 15G of GPU memory on my machine.