bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.32k stars 521 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory When Trying to Save the Model #49

Open seyyedaliayati opened 1 year ago

seyyedaliayati commented 1 year ago

Howdy!

I am using the finetune/finetune.py script. It trains on NVIDIA A40, and at the end when it tries to save the model/checkpoints it raises the torch.cuda.OutOfMemoryError: CUDA out of memory error.

Here is a full traceback:

Traceback (most recent call last):
  File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 336, in <module>
    main(args)
  File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 325, in main
    run_training(args, train_dataset, eval_dataset)
  File "/scratch/user/seyyedaliayati/auto-test-gpt/finetune.py", line 313, in run_training
    trainer.train()
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2019, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2308, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2365, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2866, in save_model
    self._save(output_dir)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/transformers/trainer.py", line 2909, in _save
    state_dict = self.model.state_dict()
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  [Previous line repeated 4 more times]
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1445, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
    self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
  File "/scratch/user/seyyedaliayati/.conda/envs/env/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
    outputs = torch.empty_like(tensor)  # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 47.38 GiB total capacity; 44.56 GiB already allocated; 109.19 MiB free;                                46.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for                                Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas what's happening and how to solve this issue?

MubarakHAlketbi commented 1 year ago

https://github.com/MHketbi/starcoder

try my fork

ywen666 commented 1 year ago

same error here with A40.

seyyedaliayati commented 1 year ago

https://github.com/MHketbi/starcoder

try my fork

Worked on NVIDIA A100 80 GB, but not on NVIDIA A40 48 GB

FrankWhh commented 1 year ago

model.gradient_checkpointing_enable()

esko22 commented 1 year ago

Got this to run on NVIDIA A100-SXM4-40GB thanks to @MHketbi

After changing device_map={"": Accelerator().process_index} to device_map='auto' the checkpoints saved without any issues. Accelerator().process_index was returning 0 which I guess was causing it to stay on the GPU and not let Accelerate do its magic....

foundten commented 1 year ago

model.gradient_checkpointing_enable

Does this help you personally? If yes, what GPU were you able to use? What were you trying to do? Finetune / Train from scratch.

ruchaa0112 commented 1 year ago

Hi @esko22 - I tried to make the following change - device_map='auto' , however, I am still getting the same error. I am using NVIDIA A100-SXM4-40GB. Are you running the fine-tuning on multi-gpu ?

Traceback (most recent call last): File "finetune/finetune.py", line 408, in <module> main(args) File "finetune/finetune.py", line 401, in main run_training(args, train_dataset, eval_dataset) File "finetune/finetune.py", line 391, in run_training trainer.train() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1883, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2195, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2765, in save_model self._save(output_dir) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2823, in _save self.model.save_pretrained( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/peft_model.py", line 135, in save_pretrained output_state_dict = get_peft_model_state_dict( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict state_dict = model.state_dict() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout return outputs.reshape(rows, cols).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 39.59 GiB total capacity; 36.59 GiB already allocated; 88.19 MiB free; 38.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

esko22 commented 1 year ago

No - single GPU A100 on Colab

ruchaa0112 commented 1 year ago

No - single GPU A100 on Colab

@esko22 - Thank you for your reply. Were you using a token size of 1024 ?