GPU memory usage keeps growing when Performing Inference on the Test Set.

JiunHaoJhan commented 2 years ago

Hi,

I success to train an xlmr-base-20220411 Encoder Model according to README. When I tried to infer on the test set, the GPU memory usage keeps growing and cause a CUDA out-of-memory issue in the end.

Even though I set the batch size to 1, the GPU memory usage is still creeping up and causing a CUDA out-of-memory issue. Could you help to figure it out?

Here is the command I use to infer the test set. torchrum --nproc_per_node=4 scripts/test.py -c examples/xlmr_base_test_20220411.yml

Here is the content of the config file xlmr_base_test_20220411.yml

run_name: &run_name xlmr_base_20220411_test
max_length: &max_length 512

model:
  type: xlmr intent classification slot filling
  checkpoint: checkpoints/xlmr_base_20220411/checkpoint-229400/

tokenizer:
  type: xlmr base
  tok_args:
    vocab_file: checkpoints/xlmr_base_20220411/checkpoint-229400/sentencepiece.bpe.model
    max_len: *max_length

collator:
  type: massive intent class slot fill
  args:
    max_length: *max_length
    padding: longest

test:
  test_dataset: massive_datasets/.test
  intent_labels: massive_datasets/.intents
  slot_labels: massive_datasets/.slots
  massive_path: ~/massive_0614/massive
  slot_labels_ignore:
    - Other
  eval_metrics: all
  predictions_file: logs/xlmr_base_20220411/preds.jsonl
  trainer_args:
    output_dir: checkpoints/xlmr_base_20220411/
    per_device_eval_batch_size: 4
    remove_unused_columns: false
    label_names:
      - intent_num
      - slots_num
    log_level: info
    logging_strategy: no
    locale_eval_strategy: all only
    #locale_eval_strategy: all and each
    disable_tqdm: false

Note: I have 15G of GPU memory on my machine.

jgmf-amazon commented 2 years ago

Hi @JiunHaoJhan , greetings. This might work: Under test.trainer_args, try adding eval_accumulation_steps as a key and some small amount as the value.

See the training arg here: https://github.com/huggingface/transformers/blob/3981ee8650042e89d9c430ec34def2d58a2a12f7/src/transformers/training_args.py#L155

Please keep us posted if that works.

cperiz commented 2 years ago

+1 to what @jgmf-amazon said. Here is an example of that in use from our mt5-base-enc run

JiunHaoJhan commented 2 years ago

It works! Thanks a lot.

cperiz commented 2 years ago

Great!

alexa / massive

GPU memory usage keeps growing when Performing Inference on the Test Set. #16