Nan values in the loss dict

purugupta99 commented 4 years ago

FloatingPointError: Loss became infinite or NaN at iteration=8! loss_dict = {'loss_cls': tensor(0.9525, device='cuda:1', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0., device='cuda:1', grad_fn=<DivBackward0>), 'loss_z_reg': tensor(nan, device='cuda:1', grad_fn=<DivBackward0>), 'loss_mask': tensor(5.8197, device='cuda:1', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_voxel': tensor(2.0721, device='cuda:1', grad_fn=<MulBackward0>), 'loss_chamfer': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_normals': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_edge': tensor(0., device='cuda:1', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.3055, device='cuda:1', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(0.0218, device='cuda:1', grad_fn=<MulBackward0>)}

I tried reducing the BASE_LR in config.yaml(as suggested in #36) but still face similar error. Can you suggest something I can try?

gkioxari commented 4 years ago

I presume you are deviating from the training recipe in the codebase. Could you give me more details on your recipe otherwise I can't help you in any way.

ShashwatNigam99 commented 4 years ago

Here are some details, we can provide you more. We are trying to train on our custom dataset. We are using the pix3d training pipeline and have modeled our data according to the pix3d dataset. We are trying to train using 2 GPUs and have reduced the IMS_PER_BATCH to 4 (from 16). We have tested this config on the actual pix3d dataset, and were able to train only with worse results.

But when we move on to train using the same config using our dataset, we face this error.

What could be the issues we should try and investigate?

You suggested reducing learning rate, but that didn't work for us.
Is the size of the dataset an issue?

gkioxari commented 4 years ago

So you are training on a your own dataset. I am not sure what is going on without having a reproducible test case so I can only speculate and give some suggestions on how to debug this based on my experience. (Side note on training on Pix3D with fewer GPUs: Yes, you should expect results to be worse when training with a smaller batch size. Reducing the learning rate could recover some points in performance but it is not expected to get you to full performance). Back to your issue

First you need to verify that your data is valid since you use your own dataset. You could do this by visualizing the minibatches (set the cfg.MODEL.VIS_MINIBATCH to True) which stores the training batches to /tmp/output. You might have some outlier data that cause the losses to spike.
Set your learning rate to something very very low and see if you get the same NaN issue.

My speculation is that something might be off in your data or when the MeshMapper transforms your data for the RoIs. If there is an outlier ground truth the loss can go very high and lead to NaN. This could be a bug in the data processing. I noticed that your losses in your error above all set to 0 for chamfer, normal and edge and only the voxel loss is positive. So presumably you are training only for voxels?

ShashwatNigam99 commented 4 years ago

We are following the norm training procedure and not training just for voxels. We will try out your suggestions get back to you. We can also share our dataset so that you could point out what's wrong?

ShashwatNigam99 commented 4 years ago

We tried putting VIS_MINIBATCH=TRUE and that made a minibatch visualization folder in our meshrcnn/output folder (didn't make in /tmp/output). We tried to make sense of that output and have attached one that came from our dataset for your reference.

Can you explain to us what 'Voxel', 'Mesh' and 'Init Mesh' mean?
Also, just to confirm, the 'Mask' shown on the right-top is the zoomed in mask part contained inside the bbox area?
Does it seem to be a problem with our bbox annotation?

1696_5

ShashwatNigam99 commented 4 years ago

75923fba-b5eb-4353-9e73-aa577c9ed8c4

823e8716-3bfd-47bb-8dbe-20abdb918673

We tried to run VIS_MINIBATCH=True on the Pix3D data, normal training procedure (meshrcnn_R50_FPN.yaml), and saw these outputs in the minibatch. Why does the same image appear repeatedly ? Are these the ground truth annotations or predictions by the network at that point of training(after a forward pass)?

gkioxari commented 4 years ago

You need to understand what we are visualizing here. So the visualization is a way to debug your data and your processing. These visualizations look correct! So you have your image input and the box overlaid (top: first and 2nd image). Then we visualize the roi-cropped gt mask, the roi-cropped gt voxel and the roi-transformed gt mesh. Init mesh is the initial mesh as predicted by the voxel head which at the beginning of training is noisy of course.

What is concerning in the visualization of your own dataset, is that the roi-cropped gt voxel and roi-transformed gt mesh are empty which means that your data processing is not doing the right thing. You need to debug this to figure this out.

ShashwatNigam99 commented 4 years ago

What we understand after looking at the pix3d dataset annotation is that the keep they assume the camera to be at origin and the object is rotated,translated by R and t; which we provide in the annotations. Is this correct? We have now tried to model our dataset in this way.
Sample vertices: Can you expain what's happening here? After fixing our camera parameters we are able to see some thing in our roi-cropped gt voxel and roi-transformed gt mesh, but very grainy. We also noticed that our objects are quite bigger than the pix3d objects. Is this because you are randomly sampling only a few number of vertices while our objects have more? Does the size of our objects affect training?

gkioxari commented 4 years ago

Here are the functions you should read and understand in an effort to draw a parallel to Pix3D:

The last two transform the mesh and voxel based on the roi selected at train time. These functions are used to construct the targets when computing the losses.

In the data preparation part of Pix3D, Pix3D provides the R,T that transforms the CAD models from canonical space to image (or what we call view) space. The latter is aligned with the image input and is what we use to transform the shapes with the selected rois.

Regarding the sampled vertices to visualize, when we visualize meshes easy we sample points from the mesh surface and project them to the image plane. The size of the mesh doesn't matter as you sample N points total from each mesh and the number of points sampled per face is proportional to the face area. So the output of the function should be N 3D points sampled from the mesh surface.

gkioxari commented 4 years ago

Closing this! Reopen if you have another question.

mikeroberts3000 commented 3 years ago

I'll add a quick observation in case anyone else encounters a similar error message (i.e., loss_z_reg == nan).

I have occasionally encountered this same error when training Mesh R-CNN with my own custom dataset. When an instance is visible in an image, but the center of its 3D axis-aligned-in-camera-space bounding box is behind the camera, it will cause loss_z_reg == nan, because the function that computes this loss term will try to take the log of a negative number. In other words, training on instances where the 3D bounding box center is behind the camera is an unsupported case. You may need to explicitly filter out such instances when training on your own custom dataset.

facebookresearch / meshrcnn

Nan values in the loss dict #51