Train metrics not finite: {'psnr': nan,} error while training

sumanttyagi commented 2 years ago

I am trying to train on very small dataset 24 images out of which 6 images are given for val. facing the below issue can you tell me what is going wrong or is it because of dataset size ?

Max image index is 23: using dtype: <class 'numpy.uint16'>
Allocating 200 chunks to dataset path /mnt/sdb1/sumant/act_nerf/building-pixsfm/oiccdust4
200 chunks allocated
100%|████████████████████████████████████████████████████████| 24/24 [01:22<00:00,  3.46s/it]
Finished writing chunks to dataset paths
  0%|                                                 | 76/500000 [00:25<41:16:11,  3.36it/s]{
  "message": {
    "message": "Exception: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>), 'loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>)}",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/home/ubuntu/anaconda3/envs/mega-nerf/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n    return f(*args, **kwargs)\n  File \"/mnt/sdb1/sumant/act_nerf/nbs/mega-nerf-main/mega_nerf/train.py\", line 24, in main\n    Runner(hparams).train()\n  File \"/mnt/sdb1/sumant/act_nerf/nbs/mega-nerf-main/mega_nerf/runner.py\", line 261, in train\n    raise Exception('Train metrics not finite: {}'.format(metrics))\nException: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>), 'loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>)}\n",
      "timestamp": "1660270116"
    }
  }
}
Traceback (most recent call last):
  File "/mnt/sdb1/sumant/act_nerf/nbs/mega-nerf-main/mega_nerf/train.py", line 28, in <module>
    main(_get_train_opts())
  File "/home/ubuntu/anaconda3/envs/mega-nerf/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/mnt/sdb1/sumant/act_nerf/nbs/mega-nerf-main/mega_nerf/train.py", line 24, in main
    Runner(hparams).train()
  File "/mnt/sdb1/sumant/act_nerf/nbs/mega-nerf-main/mega_nerf/runner.py", line 261, in train
    raise Exception('Train metrics not finite: {}'.format(metrics))
Exception: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>), 'loss': tensor(nan, device='cuda:0', grad_fn=<MseLossBackward0>)}
  0%|                                                 | 76/500000 [00:25<46:26:51,  2.99it/s]

2.While preparing also what is the use of grid_dim i have given 2 2 will it effect the output ?

sumanttyagi commented 2 years ago

@hturki kindly provide some feedback ,please look into this ?

weidezhang commented 2 years ago

i had the same issue. for my dataset, i had the ray_altitude_range set to [8,100], and use grid size 1, 1 to train a single small scaled scenario.

{ "message": { "message": "Exception: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=), 'loss': tensor(nan, device='cuda:0', grad_fn=)}", "extraInfo": { "py_callstack": "Traceback (most recent call last):\n File \"/home/weide/miniconda3/envs/mega-nerf/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py\", line 345, in wrapper\n return f(*args, **kwargs)\n File \"/home/weide/dev/meganerf/mega-nerf/mega_nerf/train.py\", line 24, in main\n Runner(hparams).train()\n File \"/home/weide/dev/meganerf/mega-nerf/mega_nerf/runner.py\", line 261, in train\n raise Exception('Train metrics not finite: {}'.format(metrics))\nException: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=), 'loss': tensor(nan, device='cuda:0', grad_fn=)}\n", "timestamp": "1660596324" } } }

sumanttyagi commented 2 years ago

@weidezhang are you able to prepare the dataset using pixsfm ?

weidezhang commented 2 years ago

@weidezhang are you able to prepare the dataset using pixsfm ?

i was using colmap to refine the poses.

hturki commented 2 years ago

You can try disabling mixed-precision training via the no-amp option if you think numerical stability is an issue, but is this a 360 degree scene? 18 images is a very small number to be training a NeRF against, and Mega-NeRF is generally designed for larger datasets. There are many NeRF variants that are likely better suited to handle low-data regimes.

sumanttyagi commented 2 years ago

@hturki let me try it with bigger dataset check this https://github.com/cmusatyalab/mega-nerf/issues/16 @menglongyue tried with 507 images still getting the same error . is there something(me @weidezhang @menglongyue) we could have missed while preparing it for custom dataset ? just to info we all creating camera poses by using colmap only instead of pixsfm

hturki commented 2 years ago

It's hard to determine from this message alone - outside of potential bugs in the codebase the main thing that the nan is calling out is that the training is not converging. You can try to trace the exact batch that's causing the nan to see if there is in fact a bug in the rendering code somewhere but if that's not the case the main thing I'd assert is that 18 images at least is a small number of images to train a large-scale 360 scene representation (what Mega-NeRF is designed for) against

sjtuytc commented 2 years ago

I met the same problems running on custom datasets.

guijuzhejiang commented 1 year ago

I also met the same problem, is it wrong with the ray_altitude_range? Then the problem is how to set the correct ray_altitude_range on custom datasets.

alan355 commented 1 year ago

I had struggled with the ‘psnr: nan’ problem for a long time. And I solved this problem by using the model_aligner part in COLMAP with the gps information that my dataset contains.

hvkwak commented 1 year ago

@alan355 @hturki Would you please state more information regarding model_aligner of COLMAP? It seems COLMAP coordinate system is first RDF. I guess that its model_aligner with my gps information changes the coordinates system(based on ENU) to something else(I guess RFU, I have had terrible days because I don't know to handle this correctly.) MegaNeRF's preprocessing is from RDF to DRB. How would you train a geo-referenced dataset?

IaroslavS commented 11 months ago

I faced the same issue. I solved it with changing parameters on yaml config file. I remove parameter ray_altitude_range and added no_ellipse_bounds: True. But when creating an octree, script throws an error because parameter ray_altitude_range is not defined. Then I defined this parameter but in Mega-NeRF I see only black rectangle.

lhc991025 commented 11 months ago

长期以来，我一直在为“psnr：nan”问题而苦苦挣扎。我通过使用 COLMAP 中的model_aligner部分和我的数据集包含的 GPS 信息解决了这个问题。

Hi, I'm having the same problem, can you elaborate on how you solved it? {NM9NX ~LD@Q%5L3G~}W~ZU

IaroslavS commented 11 months ago

长期以来，我一直在为“psnr：nan”问题而苦苦挣扎。我通过使用 COLMAP 中的model_aligner部分和我的数据集包含的 GPS 信息解决了这个问题。

Hi, I'm having the same problem, can you elaborate on how you solved it?

The problem occurs while rendering rays. Could you analyze the context of the variable results[f'rgb_{typ}'], after the line https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/runner.py#L362 Analyze for its shape, search for nans, infs Second, you should set hyper parameter fine_samples to 0 to understand if the error occurs during this https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L195 or this https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L227 Finally analyze the function _inference. I suppose the error occurs here https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L373 when variable weights for some reasons has invalid values, therefore maybe you analyze variable weights here. This variable is fromed from two variables: 'alphas' and 'T', you should print them in console after line https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L367.

All this you'd better analyze after one launch of training, because I see you've trained for 4 hours, thus you should print too much to understand more at once.

cmusatyalab / mega-nerf

Train metrics not finite: {'psnr': nan,} error while training #35