Nan loss during training

ItsThanhTung commented 7 months ago

Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.

03/21/2024 18:26:37 - INFO - __main__ - Loaded lora parameters into model                                                                                                   [91/1907]
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All model weights loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All optimizer states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All scheduler states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All random states loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.accelerator - Loading in 0 custom states                                                                                                     
Steps:   0%|                                                                                                                                                                  | 0/157
1000 [00:00<?, ?it/s]03/21/2024 18:26:37 - INFO - __main__ - Running validation...

                    {'timestep_spacing'} was not found in config. Values will be initialized to default values.                                                                      
                    | 0/6 [00:00<?, ?it/s]   
Loaded scheduler as PNDMScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-2-1-base.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00
:00<00:00, 20.30it/s]
{'use_karras_sigmas', 'solver_type', 'lambda_min_clipped', 'timestep_spacing', 'sample_max_value', 'dynamic_thresholding_ratio', 'solver_order', 'thresholding', 'variance_type', 'al
gorithm_type', 'lower_order_final'} was not found in config. Values will be initialized to default values.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.04it/s]
writing inference outputs failed module 'ffmpeg' has no attribute 'input'█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.55it/s]
03/21/2024 18:26:49 - INFO - __main__ - Running training...                                                                                                                          
Steps:   0%|                                                                                                                                     | 0/1571000 [00:31<?, ?it/s, lr=2.5e
-5, step_loss=0.0496]/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match buc
ket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an er
ror, but may impair performance.             
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. Thi
s may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair p
erformance.                                  
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                          | 8/1571000 [03:57<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [03:59<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:00<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:02<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:03<10584:31:31, 24.25s/it, lr=2
.5e-5, step_loss=nan]

Do you have any suggestions? Thanks!

ItsThanhTung commented 7 months ago

Update: I train the small version, and everything is fine.

lukasHoel commented 7 months ago

Hi,

I also encountered nan loss during training (especially when testing fp16 training), but the final configuration (2x80GB A100 GPUs with the configured learning rate and batch size) worked for me successfully without any nan's. It might be that you have to change some things in the training pipeline to make it work on your system:

Try adjusting the max_grad_norm to a lower value
A fix that always worked for me (but is a bit unsatisfying) is to just set nan gradients to zero. Add this right before optimizer.step() is called:

for p in unet.parameters():
    if p.grad is not None:
        p.grad.nan_to_num_()

ItsThanhTung commented 7 months ago

I am still facing the same issue on the train.sh (41GB of GPU memory). Moreover, is it normal to run very slowly like this? I increased num_worker=32 and set max_grad_norm=5e-4 but still facing nan loss

03/22/2024 13:58:27 - INFO - __main__ - Running training...
Steps:   0%|                                                                                                                                           | 0/1572000 [00:45<?, ?it/s, lr=5e-5, step_loss=0.185]
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Steps:   0%|                                                                                                                                            | 0/1572000 [00:49<?, ?it/s, lr=5e-5, step_loss=0.22]
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                               | 25/1572000 [05:52<5052:48:15, 11.57s/it, lr=5e-5, step_loss=nan]

lukasHoel commented 7 months ago

I would suggest to try set num_workers=0 and also include the other fix from my message above

ItsThanhTung commented 7 months ago

We've followed your training instructions to avoid nan loss, but now we're encountering exploding gradients after 20k training steps. Will you be releasing the pretrained model? It would greatly assist us in reproducing the results outlined in the paper.

lukasHoel commented 7 months ago

We do not release weights because of licensing issues. I'd be happy to help with any reproduction issues, how exactly did you get the exploding gradients? I never encountered them and think that's what the max_grad_norm parameters is avoiding

facebookresearch / ViewDiff

Nan loss during training #10