Why isn't the MLPTexture3D using the tcnn Network definition

ktertikas commented 2 months ago

Hi @jmunkberg ! First of all thank you for the great work and codebase!

I have noticed that in the MLPTexture3D class you are using the tcnn.Encoding class which operates on float16 values, and then you use a standard float32 MLP for the material property predictions (the _MLP class). I experimented replacing the _MLP network with a tcnn.Network that has the exact same number of parameters and operates in float16. However, the results are significantly worse. Have you noticed a similar behaviour in your experiments? Is this the reason why you switched to a float32 precision MLP? Do you have any intuition into why this might be happening? I would assume that since the hash encoding is already of float16 precision initially, reducing the precision of the MLP network to float16 would work (?).

For example, this is the result after a few iterations (~200 iterations) when training on the lego dataset with the tcnn.Network: img_dmtet_pass1_000002

For reference, this is the result with the standard _MLP network for the same number of iterations: img_dmtet_pass1_000002

Best regards, Konstantinos

jmunkberg commented 2 months ago

Hello @ktertikas,

We used a standard float32 MLP mostly for compatibility reasons so that the code was running on more servers. If you want to run the encoder + MLP in tcnn in float16 there may be some gradient scaling issues to consider.

Please refer to the tcnn code base to see the recommended gradient scaling approach for that use case. Note also that we currently scale the gradients for the encoder, which may need to be removed/adapted in case you run everything through tcnn.

For a general overview of gradient scaling and some best practices, please refer to https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html

ktertikas commented 2 months ago

Thank you for the immediate response @jmunkberg ! I have played around with gradient scaling a bit, but maybe it needs to be done in a more principled way. In any case thank you for the reference link!

NVlabs / nvdiffrec

Why isn't the MLPTexture3D using the tcnn Network definition #154