facebookresearch / meshrcnn

code for Mesh R-CNN, ICCV 2019
Other
1.14k stars 173 forks source link

Nan values in the loss dict #51

Closed purugupta99 closed 4 years ago

purugupta99 commented 4 years ago

FloatingPointError: Loss became infinite or NaN at iteration=8! loss_dict = {'loss_cls': tensor(0.9525, device='cuda:1', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:1', grad_fn=<DivBackward0>), 'loss_z_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>), 'loss_mask': tensor(5.8197, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_voxel': tensor(2.0721, device='cuda:1', grad_fn=<MulBackward0>), 'loss_chamfer': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_normals': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_edge': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.3055, device='cuda:1', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(0.0218, device='cuda:1', grad_fn=<MulBackward0>)}

I tried reducing the BASE_LR in config.yaml(as suggested in #36) but still face similar error. Can you suggest something I can try?

gkioxari commented 4 years ago

I presume you are deviating from the training recipe in the codebase. Could you give me more details on your recipe otherwise I can't help you in any way.

ShashwatNigam99 commented 4 years ago

Here are some details, we can provide you more. We are trying to train on our custom dataset. We are using the pix3d training pipeline and have modeled our data according to the pix3d dataset. We are trying to train using 2 GPUs and have reduced the IMS_PER_BATCH to 4 (from 16). We have tested this config on the actual pix3d dataset, and were able to train only with worse results.

But when we move on to train using the same config using our dataset, we face this error.

What could be the issues we should try and investigate?

gkioxari commented 4 years ago

So you are training on a your own dataset. I am not sure what is going on without having a reproducible test case so I can only speculate and give some suggestions on how to debug this based on my experience. (Side note on training on Pix3D with fewer GPUs: Yes, you should expect results to be worse when training with a smaller batch size. Reducing the learning rate could recover some points in performance but it is not expected to get you to full performance). Back to your issue

My speculation is that something might be off in your data or when the MeshMapper transforms your data for the RoIs. If there is an outlier ground truth the loss can go very high and lead to NaN. This could be a bug in the data processing. I noticed that your losses in your error above all set to 0 for chamfer, normal and edge and only the voxel loss is positive. So presumably you are training only for voxels?

ShashwatNigam99 commented 4 years ago

We are following the norm training procedure and not training just for voxels. We will try out your suggestions get back to you. We can also share our dataset so that you could point out what's wrong?

ShashwatNigam99 commented 4 years ago

We tried putting VIS_MINIBATCH=TRUE and that made a minibatch visualization folder in our meshrcnn/output folder (didn't make in /tmp/output). We tried to make sense of that output and have attached one that came from our dataset for your reference.

1696_5

ShashwatNigam99 commented 4 years ago

75923fba-b5eb-4353-9e73-aa577c9ed8c4

823e8716-3bfd-47bb-8dbe-20abdb918673

We tried to run VIS_MINIBATCH=True on the Pix3D data, normal training procedure (meshrcnn_R50_FPN.yaml), and saw these outputs in the minibatch. Why does the same image appear repeatedly ? Are these the ground truth annotations or predictions by the network at that point of training(after a forward pass)?

gkioxari commented 4 years ago

You need to understand what we are visualizing here. So the visualization is a way to debug your data and your processing. These visualizations look correct! So you have your image input and the box overlaid (top: first and 2nd image). Then we visualize the roi-cropped gt mask, the roi-cropped gt voxel and the roi-transformed gt mesh. Init mesh is the initial mesh as predicted by the voxel head which at the beginning of training is noisy of course.

What is concerning in the visualization of your own dataset, is that the roi-cropped gt voxel and roi-transformed gt mesh are empty which means that your data processing is not doing the right thing. You need to debug this to figure this out.

ShashwatNigam99 commented 4 years ago
gkioxari commented 4 years ago

Here are the functions you should read and understand in an effort to draw a parallel to Pix3D:

The last two transform the mesh and voxel based on the roi selected at train time. These functions are used to construct the targets when computing the losses.

In the data preparation part of Pix3D, Pix3D provides the R,T that transforms the CAD models from canonical space to image (or what we call view) space. The latter is aligned with the image input and is what we use to transform the shapes with the selected rois.

Regarding the sampled vertices to visualize, when we visualize meshes easy we sample points from the mesh surface and project them to the image plane. The size of the mesh doesn't matter as you sample N points total from each mesh and the number of points sampled per face is proportional to the face area. So the output of the function should be N 3D points sampled from the mesh surface.

gkioxari commented 4 years ago

Closing this! Reopen if you have another question.

mikeroberts3000 commented 3 years ago

I'll add a quick observation in case anyone else encounters a similar error message (i.e., loss_z_reg == nan).

I have occasionally encountered this same error when training Mesh R-CNN with my own custom dataset. When an instance is visible in an image, but the center of its 3D axis-aligned-in-camera-space bounding box is behind the camera, it will cause loss_z_reg == nan, because the function that computes this loss term will try to take the log of a negative number. In other words, training on instances where the 3D bounding box center is behind the camera is an unsupported case. You may need to explicitly filter out such instances when training on your own custom dataset.