Known issue: textureless mesh extraction & bad intermediate checkpoints

+@mli0603. This is related to multiple reported issues (#62, #75, and potentially others).

There seems to be an issue with the .state_dict() method of torch.nn.Module classes, which could be a PyTorch bug. Specifically, there seems to be a certain probability where the extracted state dict might not match the (subset of) module parameters, causing the saved checkpoints to be partially corrupted. When this happens in the final layers of the neural SDF/RGB networks, it might result in bad geometry shape (#75) or monotonically gray color (sigmoid(0)=0.5) for the object (#62).

This seems to be reproducible with (using the toy Lego example, pre-processed)

torchrun --nproc_per_node=1 train.py \
    --logdir=logs/debug/lego --show_pbar \
    --config=projects/neuralangelo/configs/custom/lego.yaml \
    --data.root=datasets/lego_ds2 \
    --max_iter=20000 --checkpoint.save_iter=1000 \
    --model.object.sdf.encoding.coarse2fine.step=200 \
    --model.object.sdf.encoding.hashgrid.dict_size=19 \
    --optim.sched.warm_up_end=200 \
    --optim.sched.two_steps=[12000,16000]

At iteration 2000, the checkpointed parameter module.neural_sdf.mlp.linear_sdf.weight would be corrupted.

NVlabs / neuralangelo

Known issue: textureless mesh extraction & bad intermediate checkpoints #83