allenai / RL4LMs

A modular RL library to fine-tune language models to human preferences
https://rl4lms.apps.allenai.org/
Apache License 2.0
2.13k stars 191 forks source link

Memory issue in metric evals? #61

Open AnujMahajanOxf opened 1 year ago

AnujMahajanOxf commented 1 year ago

Hi all,

I am encountering a gpu memory issue in metric evaluations.

I am using the following metrics:

  metrics:
    - id: meteor
      args: {}
    - id: rouge
    - id: bleu
      args: {}
    - id: bert_score # TODO AM running into cuda memory insufficient here
      args:
        language: en
    - id: cider
    - id: diversity
      args: {}

On monitoring the GPU usage for the card hosting the metric models, I see a steady increase in memory occupied:

initial:
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   51C    P0    71W / 300W |  3514MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
at 200 epochs
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   53C    P0    73W / 300W |  22171MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |

Any idea what might be causing this? Thanks