Open y0un0 opened 2 years ago
Hi, Yes, it happens for the kl divergence computing, while the mu or sigma term turn to be nan. The solution to try is lower the weight for the latent loss term. Try it and let me know if it works.
Cheers, Jing
On Sat, 20 Nov 2021 at 10:29 am, y0un0 @.***> wrote:
Hello,
Have you encountered this error during training ?
Traceback (most recent call last): File "train.py", line 136, in pred_post, pred_prior, latent_loss, depth_pred_post, depth_pred_prior = generator.forward(images,depths,gts) File "/content/UCNet/model/ResNet_models.py", line 157, in forward self.posterior, muxy, logvarxy = self.xy_encoder(torch.cat((x,depth,y),1)) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/content/UCNet/model/ResNet_models.py", line 125, in forward dist = Independent(Normal(loc=mu, scale=torch.exp(logvar)), 1) File "/usr/local/lib/python3.7/dist-packages/torch/distributions/normal.py", line 50, in init super(Normal, self).init(batch_shape, validate_args=validate_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributions/distribution.py", line 56, in init* f"Expected parameter {param} " ValueError: Expected parameter loc (Tensor of shape (10, 3)) of distribution Normal(loc: torch.Size([10, 3]), scale: torch.Size([10, 3])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan], [nan, nan, nan]], device='cuda:0', grad_fn=)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JingZhang617/UCNet/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6B4F5M5VIIYPETMMKGDFLUM3MULANCNFSM5INHAOIA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jing Zhang Ph.D. Student College of Engineering and Computer Science, Australian National University. Email: @.***
OK i will try thank you
Do you have an idea of the value that i need to use for the latent weight ? I tried using --lat_weight=5.0
unfortunatly the error popped again
1-5. 5 works for me.
On Sat, 20 Nov 2021 at 12:36 pm, y0un0 @.***> wrote:
Do you have an idea of the value that i need to use for the latent weight ? I tried using --lat_weight=5.0 unfortunatly the error popped again
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/JingZhang617/UCNet/issues/16#issuecomment-974569829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6B4FYT5PHCJJNYWQDCLDLUM33QLANCNFSM5INHAOIA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jing Zhang Ph.D. Student College of Engineering and Computer Science, Australian National University. Email: @.***
Ok i launched another experiment with a --lat_weight=1.0
, i 'll keep you updated on the results.
It worked with lat_weight=1.0
. I was wondering, can the batch_size influence the value that we put for lat_weight ? I'm asking you that because i used a batch size of 10 instead of 5.
Hello,
Have you encountered this error during training ?
Traceback (most recent call last): File "train.py", line 136, in
pred_post, pred_prior, latent_loss, depth_pred_post, depth_pred_prior = generator.forward(images,depths,gts)
File "/content/UCNet/model/ResNet_models.py", line 157, in forward
self.posterior, muxy, logvarxy = self.xy_encoder(torch.cat((x,depth,y),1))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/content/UCNet/model/ResNet_models.py", line 125, in forward
dist = Independent(Normal(loc=mu, scale=torch.exp(logvar)), 1)
File "/usr/local/lib/python3.7/dist-packages/torch/distributions/normal.py", line 50, in init
super(Normal, self).init(batch_shape, validate_args=validate_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributions/distribution.py", line 56, in init
f"Expected parameter {param} "
ValueError: Expected parameter loc (Tensor of shape (10, 3)) of distribution Normal(loc: torch.Size([10, 3]), scale: torch.Size([10, 3])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan]], device='cuda:0', grad_fn=)