Open seyyedaliayati opened 1 year ago
https://github.com/MHketbi/starcoder
try my fork
same error here with A40.
https://github.com/MHketbi/starcoder
try my fork
Worked on NVIDIA A100 80 GB, but not on NVIDIA A40 48 GB
model.gradient_checkpointing_enable()
Got this to run on NVIDIA A100-SXM4-40GB thanks to @MHketbi
After changing device_map={"": Accelerator().process_index}
to device_map='auto'
the checkpoints saved without any issues. Accelerator().process_index
was returning 0
which I guess was causing it to stay on the GPU and not let Accelerate do its magic....
model.gradient_checkpointing_enable
Does this help you personally? If yes, what GPU were you able to use? What were you trying to do? Finetune / Train from scratch.
Hi @esko22 - I tried to make the following change - device_map='auto' , however, I am still getting the same error. I am using NVIDIA A100-SXM4-40GB. Are you running the fine-tuning on multi-gpu ?
Traceback (most recent call last): File "finetune/finetune.py", line 408, in <module> main(args) File "finetune/finetune.py", line 401, in main run_training(args, train_dataset, eval_dataset) File "finetune/finetune.py", line 391, in run_training trainer.train() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 1883, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2195, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2252, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2765, in save_model self._save(output_dir) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/transformers/trainer.py", line 2823, in _save self.model.save_pretrained( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/peft_model.py", line 135, in save_pretrained output_state_dict = get_peft_model_state_dict( File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 32, in get_peft_model_state_dict state_dict = model.state_dict() File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/raid/ansysai/ruchaa/projects/pymapdlAI/pymapdlft/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout return outputs.reshape(rows, cols).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 39.59 GiB total capacity; 36.59 GiB already allocated; 88.19 MiB free; 38.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
No - single GPU A100 on Colab
No - single GPU A100 on Colab
@esko22 - Thank you for your reply. Were you using a token size of 1024 ?
Howdy!
I am using the
finetune/finetune.py
script. It trains on NVIDIA A40, and at the end when it tries to save the model/checkpoints it raises thetorch.cuda.OutOfMemoryError: CUDA out of memory
error.Here is a full traceback:
Any ideas what's happening and how to solve this issue?