Closed Actmaiji closed 1 month ago
Hi, this may happen to different configurations but we do not have perfect solutions. I suggest you to try
grad_clip
to smaller valuesYou can apply any trick you know to stabilize the training. Your training should be fine as long as the loss does not keep increasing or NaN.
Hi, this may happen to different configurations but we do not have perfect solutions. I suggest you to try
- check the data integrity
- lower the learning rate
- set
grad_clip
to smaller values- apply ema (we never tried this in any case)
- change the code, skip NaN batch
- (if none of the above helps) load the checkpoint before NaN and continue
You can apply any trick you know to stabilize the training. Your training should be fine as long as the loss does not keep increasing or NaN.
Yep: 1.Since I the loss become NaN at 6 epoch,I'm not similar with the datasets, not sure whether this mean that all data is alright 2.For the learning rate, I have set it equal 4e-6 3.grad_clip=0.5 4.haven't done 5.change the code,my code is: loss = loss.mean() if not loss.isfinite(): logging.warning(timesteps) loss = torch.tensor(0.001, device=model_pred.device) self.optimizer.zero_grad( set_to_none=self.cfg.runner.set_grads_to_none) but I encounted the timing out issue:
05 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800439 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800091 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809900 milliseconds before timing out.
And,the code is core dump, I am trying to get some information from core files.
Hi,
It seems your data is correct. The error is because you processes are not well synchronized. To do so, you need to perform communication and let all the processes know the NaN. Only skip on one process may not be possible.
I strongly recommend you to restart/resume with other random seed.
down_block_res_samples, mid_block_res_sample, \ encoder_hidden_states_with_cam = self.controlnet( noisy_latents, # b, N_cam, 4, H/8, W/8 timesteps, # b camera_param=camera_param, # b, N_cam, 189 encoder_hidden_states=encoder_hidden_states, # b, len, 768 encoder_hidden_states_uncond=encoder_hidden_states_uncond, # 1, len, 768 controlnet_cond=controlnet_image, # b, 26, 200, 200 return_dict=False, **kwargs, ) form :magicdrive/runner/multiview_runner.py
hey I Find that the mid_block_res_sample is inf,and the dtype is torch.float16 ,would you might have some advices for the bug?
NaN problem is known to appear in many models, heavily depending on your initialization and training process, and hard to reproduce. I admit that implementation error could be one of the reasons. However, given the fact that all of our training is fine, I think our code is fine. Therefore, I do not suggest you try solving this issue by debugging it.
Anyway, if you indeed find that some operator is buggy, it would be very grateful to report the exact issue. Thanks.
Could it be that the data has exceeded the representation range of fp16, leading to overflow?
Yes, that's what I mean.
It may happen during the optimization process, so clipping the grad norm may be a straightforward solution (other solutions I listed above are similar). Hope there is no real bug in the AMP module since it is provided by the training framework.
Another possible solution is to use bf16, but we do not have time for migration and testing.
Hey,Could any one give me a docker images for v100 of python enviroment, because my v100 server could not connect to the internet. thanks a lot.
You may watch your new issue :)
Steps: 0%| | 4680/1172500 [3:15:04<781:36:43, 2.41s/it, loss=0.0968, lr0=4e-5]Error executing job with overrides: ['+exp=224x400', 'runner=4gpus'] Traceback (most recent call last): File "tools/train.py", line 110, in main runner.run() File "/data/**./magicdrive/runner/base_runner.py", line 352, in run raise RuntimeError('Your loss is NaN.') RuntimeError: Your loss is NaN.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Steps: 0%| | 4734/1172500 [3:18:08<814:35:08, 2.51s/it, loss=0.294, lr0=4e-5]
hey,I train the model for 4 gpus V100,with batch_size=3,and learning_rate is 4e-5,but I got the error.Does any one have ever encounted the error?Thanks a lot.