SJoJoK / 3DGStream

[CVPR 2024 Highlight] Official repository for the paper "3DGStream: On-the-fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos".
https://sjojok.github.io/3dgstream
MIT License
302 stars 18 forks source link

Encountered with tcnn Model During NTC Warmup Test #8

Closed shaune0000 closed 3 months ago

shaune0000 commented 4 months ago

Hi, there,

Thank you for the clear code and detailed steps provided. However, while running the test data through the NTC warmup section, I encountered an issue where all values returned by the tcnn model were NaNs starting from the second iteration. I am currently using the flame steak dataset for testing. Could I have missed something in the process?

SJoJoK commented 4 months ago

Hi, Could you please provide the screenshot of output/log? Because in my machine, directly running cache_warmup.ipynb without any modification looks good to me (the cache_F_4.json and point_cloud.ply are already in this repo). image image image image

shaune0000 commented 4 months ago

Thanks for reply.

I running on windows with env: python 3.11 torch 2.3.0+cu118 latest version of tinycudann

I got this while running.

snapshot

SJoJoK commented 4 months ago

My env: OS: Ubuntu 22.04 GPU: RTX A6000 Driver: 535.86.05 CUDA: 11.8 Python: 3.8 Pytorch: 2.0.1+cu118 tinycudann: 1.7

I apologize for the inconvenience, but this issue might stem from the implementation of tinycudann or an environment mismatch. FYI, some researchers have successfully re-conduct the experiments from the pre-release code, others have not encountered the NaN issue, though they have faced other problems (for instance, see: https://github.com/SJoJoK/3DGStream/issues/7).

For debugging purposes, I recommend printing the inputs, outputs, and losses (loss_xyz, loss_rot, and loss_dummy). The appearance of NaN outputs is often linked to NaN values in losses or inputs, which can disrupt gradient-based optimization. Identifying the source of the NaN values can significantly simplify debugging. My personal practice is to print the sum of each tensor (e.g., masked_d_xyz.sum(), masked_d_rot.sum()) so that you can quickly identify any NaN values present.

shaune0000 commented 3 months ago

Thank you for your reply. I think the problem has become clear after conducting some tests.

On Windows with PyTorch 2.0.1, it is possible to train NTC, but an issue arises with Gaussian rasterization, resulting in a DLL load failure while trying to train the Gaussian model. This problem is resolved in PyTorch 2.3.0, but somehow, the tinycudann network does not work when training NTC.

Therefore, sticking to the environment you mentioned will be okay.

SJoJoK commented 3 months ago

Glad to help:)