Open theophilegervet opened 1 year ago
It's interesting that it occurs during eval. I asked @jordiclive and he said that he has trained several llama lora models in fp16 including 7B. If you want to debug this issue and investigate the cause you could set eval_steps
in the configuration to 1.
@theophilegervet. Yes that is strange, I didn't encounter this error when training 7B decapoda-research/llama-7b-hf
or 13b instead of openlm-research/open_llama_13b
with fp16.
If you set eval_steps to 1 and change the dataset does it still occur?
Perhaps also trying with peft==0.3.0
Thank you @jordiclive! peft==0.3.0
fixes the issue with lora-llama-13b
and openlm-research/open_llama_13b
.
I still have the issue with llama-7b
though. decapoda-research/llama-7b-hf
gives ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.
So I'm using huggyllama/llama-7b
.
I get the following error:
Traceback (most recent call last):
File "/home/tgervet/Open-Assistant/model/model_training/trainer_sft.py", line 477, in <module>
main()
File "/home/tgervet/Open-Assistant/model/model_training/trainer_sft.py", line 471, in main
trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
return inner_training_loop(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1863, in _inner_training_loop
self.accelerator.clip_grad_norm_(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/accelerate/accelerator.py", line 1925, in clip_grad_norm_
self.unscale_gradients()
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/accelerate/accelerator.py", line 1888, in unscale_gradients
self.scaler.unscale_(opt)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
This happens with use_flash_attention: true
or use_flash_attention: false
.
I think I need to address this issue because I'm trying to train a reward model with
python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-6.9b --wandb-entity tgervet
and get the same error there
Traceback (most recent call last):
File "/home/tgervet/Open-Assistant/model/model_training/trainer_rm.py", line 334, in <module>
main()
File "/home/tgervet/Open-Assistant/model/model_training/trainer_rm.py", line 328, in main
trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1639, in train
return inner_training_loop(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1939, in _inner_training_loop
self.scaler.unscale_(self.optimizer)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Replacing dtype: fp16
by dtype: fp32
gives an OOM error.
Could you please share your environment so I can debug the delta?
I've tried following the updated environment you provided
bitsandbytes==0.41.0
deepspeed==0.10.0
peft==0.4.0
transformers==4.31.0
flash-attn==2.0.0.post1
but still hit the same issue
I saw a similar error to the one you described without deepspeed ..to run with deepspeed you need to replace python on the command line with deepspeed, e.g. deepspeed trainer_sft.py --configs rope_scaling_test --deepspeed
.. could you please try this?
With the following deepspeed command
deepspeed trainer_sft.py --configs llama-7b webgpt_dataset_only --deepspeed
I get an OOM error on a 40GB A100 (even with batch size 1 and sequence length 128):
Traceback (most recent call last):
File "/home/tgervet/Open-Assistant/model/model_training/trainer_sft.py", line 477, in <module>
main()
File "/home/tgervet/Open-Assistant/model/model_training/trainer_sft.py", line 471, in main
trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1532, in train
return inner_training_loop(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/transformers/trainer.py", line 1655, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 310, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1209, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1444, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/tgervet/miniconda3/envs/open-assistant/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 346, in __init__
self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.10 GiB (GPU 0; 39.42 GiB total capacity; 25.13 GiB already allocated; 13.67 GiB free; 25.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It seems like deepspeed is trying to unscale to float32? This might explain I was getting a float16 error without deepspeed?
@jordiclive Were you training with or without deepspeed?
@andreaskoepf @jordiclive I'm not sure how to proceed. Supervised fine-tuning of the lora-llama-13b
model works fine for me on a 40GB A100. The float16 error only appears for non-LORA models. Maybe we could set up reward model training with LORA too?
While running supervised fine-tuning with
and the following config
training runs fine but evaluation raises the following error (at the first eval step):
with environment
Any idea what could be causing this and how to fix it?