cure-lab / MagicDrive

[ICLR24] Official implementation of the paper “MagicDrive: Street View Generation with Diverse 3D Geometry Control”
https://gaoruiyuan.com/magicdrive/
GNU Affero General Public License v3.0
528 stars 31 forks source link

Your Loss is NaN #54

Closed Actmaiji closed 1 month ago

Actmaiji commented 2 months ago

Steps: 0%| | 4680/1172500 [3:15:04<781:36:43, 2.41s/it, loss=0.0968, lr0=4e-5]Error executing job with overrides: ['+exp=224x400', 'runner=4gpus'] Traceback (most recent call last): File "tools/train.py", line 110, in main runner.run() File "/data/**./magicdrive/runner/base_runner.py", line 352, in run raise RuntimeError('Your loss is NaN.') RuntimeError: Your loss is NaN.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Steps: 0%| | 4734/1172500 [3:18:08<814:35:08, 2.51s/it, loss=0.294, lr0=4e-5]

hey,I train the model for 4 gpus V100,with batch_size=3,and learning_rate is 4e-5,but I got the error.Does any one have ever encounted the error?Thanks a lot.

flymin commented 2 months ago

Hi, this may happen to different configurations but we do not have perfect solutions. I suggest you to try

  1. check the data integrity
  2. lower the learning rate
  3. set grad_clip to smaller values
  4. apply ema (we never tried this in any case)
  5. change the code, skip NaN batch
  6. (if none of the above helps) load the checkpoint before NaN and continue

You can apply any trick you know to stabilize the training. Your training should be fine as long as the loss does not keep increasing or NaN.

Actmaiji commented 1 month ago

Hi, this may happen to different configurations but we do not have perfect solutions. I suggest you to try

  1. check the data integrity
  2. lower the learning rate
  3. set grad_clip to smaller values
  4. apply ema (we never tried this in any case)
  5. change the code, skip NaN batch
  6. (if none of the above helps) load the checkpoint before NaN and continue

You can apply any trick you know to stabilize the training. Your training should be fine as long as the loss does not keep increasing or NaN.

Yep: 1.Since I the loss become NaN at 6 epoch,I'm not similar with the datasets, not sure whether this mean that all data is alright 2.For the learning rate, I have set it equal 4e-6 3.grad_clip=0.5 4.haven't done 5.change the code,my code is: loss = loss.mean() if not loss.isfinite(): logging.warning(timesteps) loss = torch.tensor(0.001, device=model_pred.device) self.optimizer.zero_grad( set_to_none=self.cfg.runner.set_grads_to_none) but I encounted the timing out issue:

05 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800439 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800091 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809900 milliseconds before timing out.

And,the code is core dump, I am trying to get some information from core files.

flymin commented 1 month ago

Hi,

It seems your data is correct. The error is because you processes are not well synchronized. To do so, you need to perform communication and let all the processes know the NaN. Only skip on one process may not be possible.

I strongly recommend you to restart/resume with other random seed.

Actmaiji commented 1 month ago

down_block_res_samples, mid_block_res_sample, \ encoder_hidden_states_with_cam = self.controlnet( noisy_latents, # b, N_cam, 4, H/8, W/8 timesteps, # b camera_param=camera_param, # b, N_cam, 189 encoder_hidden_states=encoder_hidden_states, # b, len, 768 encoder_hidden_states_uncond=encoder_hidden_states_uncond, # 1, len, 768 controlnet_cond=controlnet_image, # b, 26, 200, 200 return_dict=False, **kwargs, ) form :magicdrive/runner/multiview_runner.py

hey I Find that the mid_block_res_sample is inf,and the dtype is torch.float16 ,would you might have some advices for the bug?

flymin commented 1 month ago

NaN problem is known to appear in many models, heavily depending on your initialization and training process, and hard to reproduce. I admit that implementation error could be one of the reasons. However, given the fact that all of our training is fine, I think our code is fine. Therefore, I do not suggest you try solving this issue by debugging it.

Anyway, if you indeed find that some operator is buggy, it would be very grateful to report the exact issue. Thanks.

Actmaiji commented 1 month ago

Could it be that the data has exceeded the representation range of fp16, leading to overflow?

flymin commented 1 month ago

Yes, that's what I mean.

It may happen during the optimization process, so clipping the grad norm may be a straightforward solution (other solutions I listed above are similar). Hope there is no real bug in the AMP module since it is provided by the training framework.

Another possible solution is to use bf16, but we do not have time for migration and testing.

Actmaiji commented 1 month ago

Hey,Could any one give me a docker images for v100 of python enviroment, because my v100 server could not connect to the internet. thanks a lot.

flymin commented 1 month ago

You may watch your new issue :)