fredzzhang / pvic

Official PyTorch implementation for ICCV2023 paper "Exploring Predicate Visual Context in Detecting Human-Object Interactions"
BSD 3-Clause "New" or "Revised" License
56 stars 7 forks source link

NaN when I use a DETR that I fine tuned #48

Closed JonasFerreiraSilva closed 1 month ago

JonasFerreiraSilva commented 2 months ago

Hello, I have a problem when trying to use pvic with a “modified” DETR, I performed a fine tune on the DETR which is found within its repository, when running with the base DETR it works, but with the DETR weights modified by running as follows:

unnamed

It ends up failing and giving this error:

Capturar2

I found my problem similar to the last two comments on this problem here

Do you have any suggestions on how I can solve this issue?

fredzzhang commented 2 months ago

Hi @JonasFerreiraSilva,

I think the NaN loss only happens occasionally during training. Perhaps you could skip those bad batches by zeroing out the gradient when the loss is NaN. You could do something like this

loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])

Cheers, Fred.

hellog2n commented 1 month ago

@JonasFerreiraSilva

Did you solve this problem? I faced the same error and I modified my code in the 191 line utils.py according to @fredzzhang comments but the loss is 0.0000.


        loss_dict = self._state.net(
            *self._state.inputs, targets=self._state.targets)
        if loss_dict['cls_loss'].isnan():
            loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])
            print(f"The HOI loss is NaN for rank {self._rank}")
fredzzhang commented 1 month ago

Hi @hellog2n,

If that doesn't solve the issue, I think you need to add that function somewhere earlier in the network. See this post for reference.

Since it's normally the spatial features that cause the issue, you can add it here.

# Compute spatial features
pairwise_spatial = compute_spatial_encodings(
    [boxes[x],], [boxes[y],], [image_sizes[i],]
)
pairwise_spatial = torch.nan_to_num(pairwise_spatial)
hellog2n commented 1 month ago

@fredzzhang Thank you for your responding! I solved. :)

morrisalp commented 1 month ago

@hellog2n Can you share what exactly you did that solved this? Was it what @fredzzhang suggested?