NaN when I use a DETR that I fine tuned

fredzzhang / pvic

[ICCV'23] Official PyTorch implementation for paper "Exploring Predicate Visual Context in Detecting Human-Object Interactions"

BSD 3-Clause "New" or "Revised" License

67 stars 8 forks source link

NaN when I use a DETR that I fine tuned #48

Closed JonasFerreiraSilva closed 6 months ago

JonasFerreiraSilva commented 7 months ago

Hello, I have a problem when trying to use pvic with a “modified” DETR, I performed a fine tune on the DETR which is found within its repository, when running with the base DETR it works, but with the DETR weights modified by running as follows:

unnamed

It ends up failing and giving this error:

Capturar2

I found my problem similar to the last two comments on this problem here

Do you have any suggestions on how I can solve this issue?

fredzzhang commented 7 months ago

Hi @JonasFerreiraSilva,

I think the NaN loss only happens occasionally during training. Perhaps you could skip those bad batches by zeroing out the gradient when the loss is NaN. You could do something like this

loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])

Cheers, Fred.

hellog2n commented 6 months ago

@JonasFerreiraSilva

Did you solve this problem? I faced the same error and I modified my code in the 191 line utils.py according to @fredzzhang comments but the loss is 0.0000.


        loss_dict = self._state.net(
            *self._state.inputs, targets=self._state.targets)
        if loss_dict['cls_loss'].isnan():
            loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])
            print(f"The HOI loss is NaN for rank {self._rank}")

fredzzhang commented 6 months ago

Hi @hellog2n,

If that doesn't solve the issue, I think you need to add that function somewhere earlier in the network. See this post for reference.

Since it's normally the spatial features that cause the issue, you can add it here.

# Compute spatial features
pairwise_spatial = compute_spatial_encodings(
    [boxes[x],], [boxes[y],], [image_sizes[i],]
)
pairwise_spatial = torch.nan_to_num(pairwise_spatial)

hellog2n commented 5 months ago

@fredzzhang Thank you for your responding! I solved. :)

morrisalp commented 5 months ago

@hellog2n Can you share what exactly you did that solved this? Was it what @fredzzhang suggested?

YangJae96 commented 1 week ago

@morrisalp Hi, did you solve the problem??