NaN in the gt_patches for lpips

suvigy commented 5 months ago

Thanks for you work. I try to use in2n with a trained nerfacto model. But I'm getting NaN-s when using lpips.

  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1533, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 298, in forward
    self._forward_cache = self._forward_reduce_state_update(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 367, in _forward_reduce_state_update
    self.update(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/torchmetrics/metric.py", line 460, in wrapped_func
    update(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/torchmetrics/image/lpip.py", line 139, in update
    loss, total = _lpips_update(img1, img2, net=self.net, normalize=self.normalize)
  File "/venv/lib/python3.8/site-packages/torchmetrics/functional/image/lpips.py", line 381, in _lpips_update
    raise ValueError(
ValueError: Expected both input arguments to be normalized tensors with shape [N, 3, H, W]. Got input with shape torch.Size([16, 3, 32, 32]) and torch.Size([16, 3, 32, 32]) and values in range [tensor(-0.8435, device='cuda:0', grad_fn=<MinBackward1>), tensor(0.8389, device='cuda:0', grad_fn=<MaxBackward1>)] and [tensor(nan, device='cuda:0'), tensor(nan, device='cuda:0')] when all values are expected to be in the [-1, 1] range.

I checked a bit, and I could see the gt_patches will contain NaN-s, at the beginning of the training sometimes, but then populates, and after some steps all the gt_patches will contain NaN-s.

Maybe could give me some hint what the reason could be? I don't use cameraoptimizer and but use masks. First I thought maybe it is because of the masks, but since the NaN-s are populating, I guess the reason is different. Image resolution is: 480x320 (closest to 512x512)

ayaanzhaque commented 5 months ago

I noticed this occurred sometimes when training past 30k iterations for some reason. Can you try training the nerfacto to just 20k iterations, and then continuing with in2n on that checkpoint? For some reason that tends to just fix the issue. Let me know if that does not work

suvigy commented 5 months ago

I tried it with 20k iterations but still it gives me Nan-s already at the beginningI also tried to lower the lpips loss multiplicator, but still resulting NaN-s at the beginning (on the gt paches).

This is how I try. My scene does not have object centric camera path, rather backward driving. I turned off camera optimization, because colmap path was quite pixel accurate, turning it on just screws up the camera positions.

ns-train in2n --data <my transforms.json> --output-dir <output dir> --load-dir <trained nerfacto model dir> --pipeline.model.camera-optimizer.mode=off --pipeline.prompt "<my prompt>" --pipeline.guidance-scale 7.5 --pipeline.image-guidance-scale 1.5  nerfstudio-data --downscale-factor 4

I optionally tried: --pipeline.model.lpips-loss-mult 0.4 but stull it NaN-s Also tried fields learning rate warmup and lower the learning rate (--optimizers.fields.optimizer.lr=0.005 --optimizers.fields.scheduler.warmup-steps=1000) but didn't help

ayaanzhaque commented 5 months ago

I see. Unfortunately it might be just a weird case of your specific dataset, as I do remember running into some of these issues. Once solution would be to just set the LPIPS loss to be off, and the results should still be pretty decent.

ayaanzhaque / instruct-nerf2nerf

NaN in the gt_patches for lpips #90