Closed JonasFerreiraSilva closed 6 months ago
Hi @JonasFerreiraSilva,
I think the NaN loss only happens occasionally during training. Perhaps you could skip those bad batches by zeroing out the gradient when the loss is NaN. You could do something like this
loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])
Cheers, Fred.
@JonasFerreiraSilva
Did you solve this problem? I faced the same error and I modified my code in the 191 line utils.py
according to @fredzzhang comments but the loss is 0.0000.
loss_dict = self._state.net(
*self._state.inputs, targets=self._state.targets)
if loss_dict['cls_loss'].isnan():
loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])
print(f"The HOI loss is NaN for rank {self._rank}")
Hi @hellog2n,
If that doesn't solve the issue, I think you need to add that function somewhere earlier in the network. See this post for reference.
Since it's normally the spatial features that cause the issue, you can add it here.
# Compute spatial features
pairwise_spatial = compute_spatial_encodings(
[boxes[x],], [boxes[y],], [image_sizes[i],]
)
pairwise_spatial = torch.nan_to_num(pairwise_spatial)
@fredzzhang Thank you for your responding! I solved. :)
@hellog2n Can you share what exactly you did that solved this? Was it what @fredzzhang suggested?
@morrisalp Hi, did you solve the problem??
Hello, I have a problem when trying to use pvic with a “modified” DETR, I performed a fine tune on the DETR which is found within its repository, when running with the base DETR it works, but with the DETR weights modified by running as follows:
It ends up failing and giving this error:
I found my problem similar to the last two comments on this problem here
Do you have any suggestions on how I can solve this issue?