running `runs/parallel_ft_lora.sh`gives overflow

hndrstwn commented 5 months ago

Hi, I am trying to parallel_ft_lora.sh without modification, but I get the overflow error from deepspeed. Here's the snippets of message:

[INFO|trainer.py:1721] 2024-02-01 16:50:44,694 >> ***** Running training *****                       
[INFO|trainer.py:1722] 2024-02-01 16:50:44,694 >>   Num examples = 117,404                                       
[INFO|trainer.py:1723] 2024-02-01 16:50:44,694 >>   Num Epochs = 1                              
[INFO|trainer.py:1724] 2024-02-01 16:50:44,694 >>   Instantaneous batch size per device = 4                 
[INFO|trainer.py:1727] 2024-02-01 16:50:44,694 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1728] 2024-02-01 16:50:44,694 >>   Gradient Accumulation steps = 4                            
[INFO|trainer.py:1729] 2024-02-01 16:50:44,694 >>   Total optimization steps = 917                          
[INFO|trainer.py:1730] 2024-02-01 16:50:44,696 >>   Number of trainable parameters = 7,733,248
  0%|                                                                                                                                                                                                                                                                                                                                             | 0/917 [00:00<?, ?it/s]
[2024-02-01 16:50:47,313] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
  0%|▎

all the way to


[2024-02-01 16:51:58,737] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
  5%|██████████████▊                                                                                                                                                                                                                                                                                                                     | 42/917 [01:14<25:24,  1.74s/it]
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                        
  File "/mnt/task_runtime/ALMA/run_llmmt.py", line 223, in <module>                                  
    main()                                                                                                      
  File "/mnt/task_runtime/ALMA/run_llmmt.py", line 172, in main                                                                                         
    train_result = trainer.train(resume_from_checkpoint=checkpoint)                                                                                                                                                          
  File "/miniforge/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train                     
    return inner_training_loop(                                                                                  
  File "/miniforge/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)                                          
  File "/miniforge/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step               
    self.accelerator.backward(loss)                                                  
  File "/miniforge/lib/python3.10/site-packages/accelerate/accelerator.py", line 1958, in backward          
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)               
  File "/miniforge/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 176, in backward                
    self.engine.step()                                                                       
  File "/miniforge/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2157, in step                 
    self._take_model_step(lr_kwargs)                                                                        
  File "/miniforge/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2063, in _take_model_step
    self.optimizer.step()                                                                                                                                                                                                                                                                                                                                                 
  File "/miniforge/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1799, in step                            
    self._update_scale(self.overflow)                                                             
  File "/miniforge/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2050, in _update_scale
    self.loss_scaler.update_scale(has_overflow)                                                      
  File "/miniforge/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(                                                                            
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. ```

Can anyone help? Also, I notice that the number of parameters is 7.7B instead of `--lora_rank 16` and `--use_peft`.

fe1ixxu commented 5 months ago

Hi, thanks for the interest!

Some points in my mind that could help you to fix the overflow issue:

Try bf16 rather than fp16
uninstall the transformers and reinstall it by pip install git+https://github.com/fe1ixxu/ALMA.git@alma-r-hf

As for number of parameters, I believe you may misread it: 7,733,248 is 7 million 😊

Let me know if you have further questions. Have a lovely day!

hndrstwn commented 4 months ago

Thanks! Clean install works!

fe1ixxu / ALMA

running `runs/parallel_ft_lora.sh`gives overflow #25