HashGrid + MLP outputs nans

alvaro-budria commented 2 years ago

I am using a hash grid + mlp scheme like this:

        self.mlp_sdf = tcnn.NetworkWithInputEncoding(
            n_input_dims=num_dim,
            n_output_dims=1 + self.geo_feat_dim,  # if self.geo_feat_dim = 15 is too small, increase it, at the expense of not using a fully fused mlp but cutlass
            encoding_config={
                "otype": "HashGrid",
                "n_levels": n_levels,
                "n_features_per_level": 2,
                "log2_hashmap_size": log2_hashmap_size,
                "base_resolution": 16,
                "per_level_scale": per_level_scale,
            },
            network_config={
                "otype": "FullyFusedMLP",
                "activation": "ReLU",
                "output_activation": "None",
                "n_neurons": 64,
                "n_hidden_layers": 2,
            },
        )

which is showing a strange behaviour where this module will produce an output consisting exclusively of nans. The error seems to happen quite randomly, in the sense that it happens after a different number of epochs each time.

I have not found a previous issue discussing this, so I hopefully we can find a solution to this problem.

I am using python 3.8, pytorch==1.11.0+cu113, cuda 11.3 and tinycudann==1.6 compiled with compute capability 75 (although I'm using an RTX 3090 of sm_86).

Any thoughts?

EDIT:

To check which component is producing nans, I separated the two modules into two blocks (hash grid and mlp) and looked into their respective outputs. It seems that the hash grid is producing an output containing nan values.

EDIT2:

I tried removing the hash grid, and it turns out the mlp is also producing nans sometimes... I am sure the input contains no nans as I am checking it with assert not torch.isnan(x).any().

I also tried substituting the cudann mlp for a pure PyTorch one. Then the error goes away, so it seems that tinycudann is the culprit.

alvaro-budria commented 2 years ago

I'm closing this. It turns out that the occupancy grid I am using was filtering out all samples for some of the batches, so the input to the MLP was an empty tensor...

rohana96 commented 1 year ago

Hi @alvaro-budria, Thanks for the description. I'm facing a similar issue right now - could you please elaborate on how you went about debugging this? Specifically, what was the solution that you implemented after you found the issue? Thanks

alvaro-budria commented 1 year ago

Hello @rohana96 , I honestly can't recall properly, it was a few months ago. I believe at that time I was doing some trials with a very reduced number of rays and number of samples per ray (because of some other VRAM issues I was dealing with), so the chance of some batch at some point having all samples filtered out was actually quite high. In the end I solved the memory problem, and then increased the batch size/number of rays and samples per ray, which ended up solving the nan problem.

alvaro-budria commented 1 year ago

Hi again @rohana96, I encountered this problem again. This time however the cause was numerical instabilities. I found that one of the weight norm layers in my native PyTorch MLP was introducing nans, which were then (back)propagated to the tinycudann hash encoding. So at the next iteration, the output of the encoding was a nan as well.

To solve this you can try:

Add a small weight decay to the network(s), like 1e-6.
Remove the weight norm layer altogether.
Hack the training a bit. Right after computing your loss for a given batch, add the following: if loss.isnan(): loss = 1e-6.

NVlabs / tiny-cuda-nn

HashGrid + MLP outputs nans #179