a-r-r-o-w / cogvideox-factory

Memory optimized finetuning scripts for CogVideoX using TorchAO and DeepSpeed
Apache License 2.0
250 stars 19 forks source link

cannot access local variable 'gradient_norm_before_clip' where it is not associated with a value #41

Open Yuancheng-Xu opened 2 days ago

Yuancheng-Xu commented 2 days ago

During both I2V and t2V training, sometimes I encountered the error

[rank1]:   File "/root/projects/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 762, in main
[rank1]:     "gradient_norm_before_clip": gradient_norm_before_clip,
[rank1]:                                  ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: UnboundLocalError: cannot access local variable 'gradient_norm_before_clip' where it is not associated with a value

This is probably here in the following code

if accelerator.sync_gradients:
        gradient_norm_before_clip = get_gradient_norm(transformer.parameters())
        accelerator.clip_grad_norm_(transformer.parameters(), args.max_grad_norm)
        gradient_norm_after_clip = get_gradient_norm(transformer.parameters())

somehow accelerator.sync_gradients is false sometimes.

Is there a quick fix? Is it only for logging?

a-r-r-o-w commented 2 days ago

Hmm, that is very weird... It means that there is no gradient update for an entire epoch. I think it could be the case for training with very low amount of videos? How many videos were you using when this occured, what was your batch size and gradient accumulation steps?

Yuancheng-Xu commented 1 day ago

For example, using 2 GPUs, which is something like following

***** Running training *****                                                                                                                                                                                
  Num trainable parameters = 132120576                                                                                                                                                                      
  Num examples = 128                                                                                                                                                                                        
  Num batches each epoch = 64                                                                                                                                                                               
  Num epochs = 5                                                                                                                                                                                            
  Instantaneous batch size per device = 1                                                                                                                                                                   
  Total train batch size (w. parallel, distributed & accumulation) = 32                                                                                                                                     
  Gradient accumulation steps = 16                                                                                                                                                                          
  Total optimization steps = 20   

Same error when using 8GPUs

***** Running training *****                                                                                                                                                                                
  Num trainable parameters = 132120576                                                                                                                                                                                         
  Num examples = 128                                                                                                                                                                                        
  Num batches each epoch = 16                                                                                                                                                                               
  Num epochs = 2                                                                                                                                                                                            
  Instantaneous batch size per device = 1                                                                                                                                                                   
  Total train batch size (w. parallel, distributed & accumulation) = 32                                                                                                                                     
  Gradient accumulation steps = 4                                                                                                                                                                           
  Total optimization steps = 5   

I don't think it is the issue of low amount of training samples?

a-r-r-o-w commented 18 hours ago

Do you see the error when using gradient accumulation steps as 1? I haven't really experimented with higher gradient_accumulation_steps yet, so I might have missed something. Most of my experiments use train_batch_size=4 and 2/4 GPUs with 100 videos.

The script can be improved a little by only logging gradnorm if it was computed. However, I fail to see how no gradient step occurs with the configurations you shared. Will take a deeper look over the weekend

Yuancheng-Xu commented 14 hours ago

Yep I just confirm that, every thing works fine with gradient accumulation steps=1. As long as it is not 1, the error occurs.

a-r-r-o-w commented 12 hours ago

Okay thanks, that helps. Will try debugging over the weekend