Open sumanttyagi opened 2 years ago
@hturki kindly provide some feedback ,please look into this ?
i had the same issue. for my dataset, i had the ray_altitude_range set to [8,100], and use grid size 1, 1 to train a single small scaled scenario.
{
"message": {
"message": "Exception: Train metrics not finite: {'psnr': nan, 'depth_variance': tensor(nan, device='cuda:0'), 'photo_loss': tensor(nan, device='cuda:0', grad_fn=
@weidezhang are you able to prepare the dataset using pixsfm ?
@weidezhang are you able to prepare the dataset using pixsfm ?
i was using colmap to refine the poses.
You can try disabling mixed-precision training via the no-amp option if you think numerical stability is an issue, but is this a 360 degree scene? 18 images is a very small number to be training a NeRF against, and Mega-NeRF is generally designed for larger datasets. There are many NeRF variants that are likely better suited to handle low-data regimes.
@hturki let me try it with bigger dataset check this https://github.com/cmusatyalab/mega-nerf/issues/16 @menglongyue tried with 507 images still getting the same error . is there something(me @weidezhang @menglongyue) we could have missed while preparing it for custom dataset ? just to info we all creating camera poses by using colmap only instead of pixsfm
It's hard to determine from this message alone - outside of potential bugs in the codebase the main thing that the nan is calling out is that the training is not converging. You can try to trace the exact batch that's causing the nan to see if there is in fact a bug in the rendering code somewhere but if that's not the case the main thing I'd assert is that 18 images at least is a small number of images to train a large-scale 360 scene representation (what Mega-NeRF is designed for) against
I met the same problems running on custom datasets.
I also met the same problem, is it wrong with the ray_altitude_range? Then the problem is how to set the correct ray_altitude_range on custom datasets.
I had struggled with the ‘psnr: nan’ problem for a long time. And I solved this problem by using the model_aligner part in COLMAP with the gps information that my dataset contains.
@alan355 @hturki Would you please state more information regarding model_aligner of COLMAP? It seems COLMAP coordinate system is first RDF. I guess that its model_aligner with my gps information changes the coordinates system(based on ENU) to something else(I guess RFU, I have had terrible days because I don't know to handle this correctly.) MegaNeRF's preprocessing is from RDF to DRB. How would you train a geo-referenced dataset?
I faced the same issue. I solved it with changing parameters on yaml config file. I remove parameter ray_altitude_range and added no_ellipse_bounds: True
. But when creating an octree, script throws an error because parameter ray_altitude_range is not defined. Then I defined this parameter but in Mega-NeRF I see only black rectangle.
长期以来,我一直在为“psnr:nan”问题而苦苦挣扎。我通过使用 COLMAP 中的model_aligner部分和我的数据集包含的 GPS 信息解决了这个问题。
Hi, I'm having the same problem, can you elaborate on how you solved it?
长期以来,我一直在为“psnr:nan”问题而苦苦挣扎。我通过使用 COLMAP 中的model_aligner部分和我的数据集包含的 GPS 信息解决了这个问题。
Hi, I'm having the same problem, can you elaborate on how you solved it?
The problem occurs while rendering rays. Could you analyze the context of the variable results[f'rgb_{typ}'], after the line https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/runner.py#L362 Analyze for its shape, search for nans, infs Second, you should set hyper parameter fine_samples to 0 to understand if the error occurs during this https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L195 or this https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L227 Finally analyze the function _inference. I suppose the error occurs here https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L373 when variable weights for some reasons has invalid values, therefore maybe you analyze variable weights here. This variable is fromed from two variables: 'alphas' and 'T', you should print them in console after line https://github.com/cmusatyalab/mega-nerf/blob/main/mega_nerf/rendering.py#L367.
All this you'd better analyze after one launch of training, because I see you've trained for 4 hours, thus you should print too much to understand more at once.
I am trying to train on very small dataset 24 images out of which 6 images are given for val. facing the below issue can you tell me what is going wrong or is it because of dataset size ?
2.While preparing also what is the use of grid_dim i have given 2 2 will it effect the output ?