Closed alvaro-budria closed 2 years ago
I'm closing this. It turns out that the occupancy grid I am using was filtering out all samples for some of the batches, so the input to the MLP was an empty tensor...
Hi @alvaro-budria, Thanks for the description. I'm facing a similar issue right now - could you please elaborate on how you went about debugging this? Specifically, what was the solution that you implemented after you found the issue? Thanks
Hello @rohana96 , I honestly can't recall properly, it was a few months ago. I believe at that time I was doing some trials with a very reduced number of rays and number of samples per ray (because of some other VRAM issues I was dealing with), so the chance of some batch at some point having all samples filtered out was actually quite high. In the end I solved the memory problem, and then increased the batch size/number of rays and samples per ray, which ended up solving the nan problem.
Hi again @rohana96, I encountered this problem again. This time however the cause was numerical instabilities. I found that one of the weight norm layers in my native PyTorch MLP was introducing nan
s, which were then (back)propagated to the tinycudann
hash encoding. So at the next iteration, the output of the encoding was a nan
as well.
To solve this you can try:
1e-6
.if loss.isnan(): loss = 1e-6
.
I am using a hash grid + mlp scheme like this:
which is showing a strange behaviour where this module will produce an output consisting exclusively of
nan
s. The error seems to happen quite randomly, in the sense that it happens after a different number of epochs each time.I have not found a previous issue discussing this, so I hopefully we can find a solution to this problem.
I am using python 3.8,
pytorch==1.11.0+cu113
, cuda 11.3 andtinycudann==1.6
compiled with compute capability 75 (although I'm using an RTX 3090 of sm_86).Any thoughts?
EDIT:
To check which component is producing nans, I separated the two modules into two blocks (hash grid and mlp) and looked into their respective outputs. It seems that the hash grid is producing an output containing
nan
values.EDIT2:
I tried removing the hash grid, and it turns out the mlp is also producing nans sometimes... I am sure the input contains no nans as I am checking it with
assert not torch.isnan(x).any()
.I also tried substituting the cudann mlp for a pure PyTorch one. Then the error goes away, so it seems that tinycudann is the culprit.