AssertionError: No inf checks were recorded for this optimizer.

SheldonTsui / GOF_NeurIPS2021

The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

Apache License 2.0

103 stars 6 forks source link

AssertionError: No inf checks were recorded for this optimizer. #2

Closed ashawkey closed 2 years ago

ashawkey commented 2 years ago

Hello, when trying to train the model by myself, I met the following error:

Traceback (most recent call last):
  File ".../site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File ".../GOF_NeurIPS2021/train.py", line 340, in train
    scaler.step(optimizer_G)
  File ".../site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."

The environment is the same as in requirements.txt (besides, the package name mcubes should be PyMCubes?). I tried to comment that line in grad_scaler.py, although it can train now, the results seem not converging (output is still random noise after around 30000 steps). Any help would be appreciated!

SheldonTsui commented 2 years ago

@ashawkey For the package 'PyMCubes', thank you for pointing it out. Besides, for the running errors reported, there might be some inf or nan values in the optimizer G during training. You can try to find them out and fix the bugs accordingly.

ashawkey commented 2 years ago

@SheldonTsui Thanks for the quick reply! However, after some more experiments, I find that all of the params in optimizer_G has None grad and are skipped in unscale_, which is quite confusing. Could you kindly provide a minimal trainable example with the code in this repo? Current scripts in auto_bash seem incomplete. (The paper also mentioned loading an early correct outward-facing pretrained model, is it also provided?)

SheldonTsui commented 2 years ago

Hi @ashawkey . I have already provided the early pre-trained models. Please refer to the updated README. Now you can debug with these early models.

ashawkey commented 2 years ago

Thanks a lot!

bluestyle97 commented 2 years ago

@ashawkey Hi, have you solved it? I also meet this problem.

SheldonTsui commented 2 years ago

I find that this bug will be encountered sometimes when I use PyTorch 1.8. At this time, I find it may be safer to use PyTorch 1.7.1 instead to avoid this problem. I'll update 'the requirement.txt' accordingly.