LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.92k stars 3.22k forks source link

Consuming much more gpu memory than expected using model_training and model_eval #3611

Closed SingL3 closed 12 months ago

SingL3 commented 1 year ago

Previously, I have trained a pythia-6.9b using code here: dolly I can train with the below setting on 4xA100(80G) without GPU OOM:

per-device-train-batch-size: 8
per-device-eval-batch-size: 8
gradient-accumulation-steps: 2
max len: 2048
gradient checkpointing: false
use_cache: true
bf16: true

with deepspeed config here I can also evaluate the output model with lm-evaluation-harness on single gpu with a non-one batch size. However, now I am using model_training to train a reward model. I can only run with the below setting on 8xA100(80G):

per_device_train_batch_size: 4 # can be bigger using gradient checkpointing
per_device_eval_batch_size: 4
gradient_accumulation_steps: 4
max len: 2048
gradient checkpointing: true # otherwise got GPU OOM even with per_device_train_batch_size 1
use_cache: false # have to turn off since conflict with grandient checkpointing
bf16: true

with deepspeed config zero3_config_sft.config.(As you can see, it is very alike with the one above) In addition, I can not eval the output model using eval_rm.py on single gpu(even with batch size 1) because of GPU OOM. I didnt find any code that reduce GPU memory in dolly or lm-evaluation-harness. And the model GPTNeoXforCasualLm should consume more memory than GPTNeoXRewardModel as I see the code of the output layer.

andreaskoepf commented 1 year ago

Yes, I also noticed that our current trainer code / configurations don't work even for smaller models on single 80 GB GPUs. It would be great to get this analyzed/fixed.

SingL3 commented 1 year ago

@andreaskoepf I will take a look into this issue and try to fix some of them (I think there may be multiple reasons for this). If you have any clue or suggestion, please let me know and I would appreciate.