Open ItsThanhTung opened 7 months ago
Update: I train the small version, and everything is fine.
Hi,
I also encountered nan loss during training (especially when testing fp16 training), but the final configuration (2x80GB A100 GPUs with the configured learning rate and batch size) worked for me successfully without any nan's. It might be that you have to change some things in the training pipeline to make it work on your system:
Try adjusting the max_grad_norm
to a lower value
A fix that always worked for me (but is a bit unsatisfying) is to just set nan gradients to zero. Add this right before optimizer.step()
is called:
for p in unet.parameters():
if p.grad is not None:
p.grad.nan_to_num_()
I am still facing the same issue on the train.sh (41GB of GPU memory). Moreover, is it normal to run very slowly like this? I increased num_worker=32 and set max_grad_norm=5e-4 but still facing nan loss
03/22/2024 13:58:27 - INFO - __main__ - Running training...
Steps: 0%| | 0/1572000 [00:45<?, ?it/s, lr=5e-5, step_loss=0.185]
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/nthanh/miniconda3/envs/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created accordi
ng to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Steps: 0%| | 0/1572000 [00:49<?, ?it/s, lr=5e-5, step_loss=0.22]
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/22/2024 13:59:02 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps: 0%| | 25/1572000 [05:52<5052:48:15, 11.57s/it, lr=5e-5, step_loss=nan]
I would suggest to try set num_workers=0 and also include the other fix from my message above
We've followed your training instructions to avoid nan loss, but now we're encountering exploding gradients after 20k training steps. Will you be releasing the pretrained model? It would greatly assist us in reproducing the results outlined in the paper.
We do not release weights because of licensing issues. I'd be happy to help with any reproduction issues, how exactly did you get the exploding gradients? I never encountered them and think that's what the max_grad_norm parameters is avoiding
Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.
Do you have any suggestions? Thanks!