Nan when training? How to solve?

chenkang455 commented 8 months ago

I've tried running BAD-NeRF on my own dataset, however encountered Nan during training, which parameters can be adjusted to solve this problem?

LingzheZhao commented 8 months ago

Hi, you can first try lowering the pose_lrate, as we discovered and described here in the README

chenkang455 commented 8 months ago

Thanks for your advice. Could you please clarify when the "nan" values occurred during your experimental process? I observed the appearance of "nan" values approximately after 10,000 iterations. Even after adjusting the pose learning rate to 1e-4, the issue persisted.

chenkang455 commented 8 months ago

I‘ve tried adjust the near and far to 2 and 6,which are the parameters for lego in NeRF, but the problem still exists.

wangpeng000 commented 8 months ago

@chenkang455 Hi, we didn't process 360° scene (like lego) and the codebase aims to handle the "llff" scene. Nan value problem may happen at spline function. In our very previous experiments, this nan value problem appears with a small probability, but it basically will not happen when we decrease the initial pose learning rate.

If you want to handle 360° blur data, my advice is to transfer our spline method to a NeRF model which can directly handle 360° scene, like NeRFStudio (https://github.com/WU-CVGL/BAD-NeRFstudio) or some other framework. What's more, in the orginal nerf-pytorch code, parameters of "ndc", "near" and "far" are also influenced by the scene type, we think the code should work well in forward ("llff") scene.

LingzheZhao commented 8 months ago

As @wangpeng000 points out, this code base used NDC scene contraction by default, so if your custom data does not follow the LLFF style, a workaround may be turning that option off.

You can also try out our actively maintained BAD-Nerfstudio, since nerfstudio can handle various data types and it runs much faster.

chenkang455 commented 8 months ago

@LingzheZhao @wangpeng000 Thanks for your advice. I set the ndc False and no_ndc True. Now it seemingly works for my lego dataset with no NAN.

chenkang455 commented 7 months ago

Hi @LingzheZhao @wangpeng000,

Thank you for your detailed responses. I've come across another issue. As all the datasets in your paper are in LLFF style, I'm looking to use a 360-degree scene, like Lego, which necessitates setting the ndc to False. However, the results appear to be relatively subpar.

I'm wondering if the problem is attributed to the load_llff_data function (replacing load_blender_data with load_llff_data might resolve it) or if it's connected to the NDC setting.


[TRAIN] Iter: 198200 Loss: 0.004275559913367033  coarse_loss:, 0.0025209172163158655, PSNR: 27.5581111907959
[TRAIN] Iter: 198300 Loss: 0.0122279217466712  coarse_loss:, 0.0069176210090518, PSNR: 22.748807907104492
[TRAIN] Iter: 198400 Loss: 0.008158434182405472  coarse_loss:, 0.0045228805392980576, PSNR: 24.3942928314209
[TRAIN] Iter: 198500 Loss: 0.009360695257782936  coarse_loss:, 0.005309312138706446, PSNR: 23.923967361450195
[TRAIN] Iter: 198600 Loss: 0.005872755311429501  coarse_loss:, 0.0035669079516083, PSNR: 26.371694564819336
[TRAIN] Iter: 198700 Loss: 0.00870824046432972  coarse_loss:, 0.004816955421119928, PSNR: 24.099069595336914
[TRAIN] Iter: 198800 Loss: 0.0047332243993878365  coarse_loss:, 0.002672248985618353, PSNR: 26.859272003173828
[TRAIN] Iter: 198900 Loss: 0.006757638417184353  coarse_loss:, 0.003632021602243185, PSNR: 25.050642013549805
[TRAIN] Iter: 199000 Loss: 0.005781751591712236  coarse_loss:, 0.0031937966123223305, PSNR: 25.870431900024414
[TRAIN] Iter: 199100 Loss: 0.005918695125728846  coarse_loss:, 0.003178289858624339, PSNR: 25.62185287475586
[TRAIN] Iter: 199200 Loss: 0.006465458311140537  coarse_loss:, 0.0036145057529211044, PSNR: 25.45009994506836
[TRAIN] Iter: 199300 Loss: 0.005777256563305855  coarse_loss:, 0.003232533112168312, PSNR: 25.943593978881836
[TRAIN] Iter: 199400 Loss: 0.005243922583758831  coarse_loss:, 0.0030064096208661795, PSNR: 26.502344131469727
[TRAIN] Iter: 199500 Loss: 0.005706350319087505  coarse_loss:, 0.0031587404664605856, PSNR: 25.938671112060547
[TRAIN] Iter: 199600 Loss: 0.0033667951356619596  coarse_loss:, 0.0018724045949056745, PSNR: 28.255355834960938
[TRAIN] Iter: 199700 Loss: 0.006072608754038811  coarse_loss:, 0.00363011471927166, PSNR: 26.121665954589844
[TRAIN] Iter: 199800 Loss: 0.0029980246908962727  coarse_loss:, 0.0016257810639217496, PSNR: 28.625686645507812
[TRAIN] Iter: 199900 Loss: 0.0053638434037566185  coarse_loss:, 0.002897790865972638, PSNR: 26.079975128173828
[TRAIN] Iter: 200000 Loss: 0.0062975799664855  coarse_loss:, 0.0033614598214626312, PSNR: 25.3222599029541

wangpeng000 commented 7 months ago

@chenkang455, We have no plans to update this repository. Please refer to these https://github.com/WU-CVGL/BAD-NeRF/issues/9#issuecomment-1869906340 and https://github.com/limacv/Deblur-NeRF/issues/37

chenkang455 commented 7 months ago

@chenkang455, We have no plans to update this repository. Please refer to these #9 (comment) and limacv/Deblur-NeRF#37

Got it ! Thanks for your advice.

WU-CVGL / BAD-NeRF

Nan when training? How to solve? #9