fredzzhang / pvic

[ICCV'23] Official PyTorch implementation for paper "Exploring Predicate Visual Context in Detecting Human-Object Interactions"
BSD 3-Clause "New" or "Revised" License
61 stars 8 forks source link

Training ERROR #53

Closed xiezouqu closed 1 month ago

xiezouqu commented 1 month ago

Hello, I respectfully appreciate the work you have done. I encountered the following issue during training, and I would greatly appreciate your help in solving it. WARNING: Collected results are empty. Return zero AP for class 597. WARNING: Collected results are empty. Return zero AP for class 598. WARNING: Collected results are empty. Return zero AP for class 599. Epoch 0 => mAP: 0.0000, rare: 0.0000, none-rare: 0.0000. Traceback (most recent call last): File "main.py", line 192, in mp.spawn(main, nprocs=args.world_size, args=(args,)) File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/data1/hujiajun/workstation/pvic/main.py", line 127, in main engine(args.epochs) File "/data1/hujiajun/workstation/pvic/pocket/pocket/core/distributed.py", line 139, in call self._on_each_iteration() File "/data1/hujiajun/workstation/pvic/utils.py", line 195, in _on_each_iteration raise ValueError(f"The HOI loss is NaN for rank {self._rank}") ValueError: The HOI loss is NaN for rank 2

fredzzhang commented 1 month ago

Hi @xiezouqu,

The nan loss is a known issue. Could you refer to this post and see if it solves the problem?

Fred.

xiezouqu commented 1 month ago

@fredzzhang Sure, I'll give it a try. Thank you very much.

xiezouqu commented 1 month ago

I have already tried add "pairwise_spatial = torch.nan_to_num(pairwise_spatial)".

# Compute spatial features
            pairwise_spatial = compute_spatial_encodings(
                [boxes[x],], [boxes[y],], [image_sizes[i],]
            )
            pairwise_spatial = torch.nan_to_num(pairwise_spatial)
            pairwise_spatial = self.spatial_head(pairwise_spatial)
            pairwise_spatial_reshaped = pairwise_spatial.reshape(n, n, -1)

I run "DETR=base python main.py --pretrained checkpoints/detr-r50-hicodet.pth --output-dir outputs/pvic-detr-r50-hicodet --world-size 4" but still encountered an error

WARNING: Collected results are empty. Return zero AP for class 598.
WARNING: Collected results are empty. Return zero AP for class 599.
Epoch 0 =>      mAP: 0.0000, rare: 0.0000, none-rare: 0.0000.
Traceback (most recent call last):
  File "main.py", line 192, in <module>
    mp.spawn(main, nprocs=args.world_size, args=(args,))
  File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/data1/hujiajun/software/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/data1/hujiajun/workstation/pvic/main.py", line 127, in main
    engine(args.epochs)
  File "/data1/hujiajun/workstation/pvic/pocket/pocket/core/distributed.py", line 139, in __call__
    self._on_each_iteration()
  File "/data1/hujiajun/workstation/pvic/utils.py", line 195, in _on_each_iteration
    raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 2
xiezouqu commented 1 month ago

@fredzzhang

fredzzhang commented 1 month ago

Based on the log you posted, the model is returning 0 mAP at initialisation, which suggests the pre-trained model wasn't probably loaded. Did you check if you in fact have the pre-trained detector weights at path checkpoints/detr-r50-hicodet.pth?

xiezouqu commented 1 month ago

You're right, it is as you said. Thank you for your response. @fredzzhang

caochong12 commented 1 month ago

Hello sir, I have made the following changes to your proposed amendments:

    if loss_dict['cls_loss'].isnan():
        loss_dict['cls_loss'] = torch.nan_to_num(loss_dict['cls_loss'])
    if loss_dict['cls_loss'].isnan():
        raise ValueError(f"The HOI loss is NaN for rank {self._rank}")

        pairwise_spatial = compute_spatial_encodings(
            [boxes[x], ], [boxes[y], ], [image_sizes[i], ]
        )
        pairwise_spatial = torch.nan_to_num(pairwise_spatial)  # github评论区添加的
        pairwise_spatial = self.spatial_head(pairwise_spatial)
        pairwise_spatial_reshaped = pairwise_spatial.reshape(n, n, -1)

However, there were cases where the loss was 0 during training, like the following:

Namespace(alpha=0.5, aux_loss=True, backbone='swin_large', batch_size=16, bbox_loss_coef=5, box_score_thresh=0.05, cache=False, clip_max_norm=0.1, cls_loss_coef=2, data_root='./hicodet', dataset='hicodet', dec_layers=6, dec_n_points=4, detector='advanced', device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, drop_path_rate=0.5, dropout=0.0, enc_layers=6, enc_n_points=4, epochs=30, eval=False, focal_alpha=0.25, gamma=0.1, giou_loss_coef=2, hidden_dim=256, kv_src='C5', look_forward_twice=True, lr_backbone=0.0, lr_drop=20, lr_drop_factor=0.2, lr_head=0.0001, mask_loss_coef=1, masks=False, max_instances=15, min_instances=3, mixed_selection=True, nheads=8, num_feature_levels=4, num_queries_one2many=1500, num_queries_one2one=900, num_workers=2, output_dir='outputs/pvic-h-defm-detr-swinL-hicodet', partitions=['train2015', 'test2015'], port='1234', position_embedding='sine', position_embedding_scale=6.283185307179586, pretrained='checkpoints/h-defm-detr-swinL-dp0-mqs-lft-iter-2stg-hicodet.pth', pretrained_backbone_path=None, print_interval=100, raw_lambda=1.7, repr_dim=384, resume='', sanity=False, seed=140, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, topk=100, triplet_dec_layers=2, triplet_enc_layers=1, two_stage=True, use_checkpoint=True, use_wandb=False, weight_decay=0.0001, with_box_refine=True, world_size=2) topk for eval: 100 Rank 1: Load weights for the object detector from checkpoints/h-defm-detr-swinL-dp0-mqs-lft-iter-2stg-hicodet.pth topk for eval: 100 Rank 0: Load weights for the object detector from checkpoints/h-defm-detr-swinL-dp0-mqs-lft-iter-2stg-hicodet.pth => Rank 0: PViC randomly initialised. => Rank 1: PViC randomly initialised. Epoch 0 => mAP: 0.0000, rare: 0.0000, none-rare: 0.0000. Epoch [1/30], Iter. [0100/2352], Loss: inf, Time[Data/Iter.]: [3.37s/148.60s] Epoch [1/30], Iter. [0200/2352], Loss: 0.0000, Time[Data/Iter.]: [0.09s/149.54s] Epoch [1/30], Iter. [0300/2352], Loss: 0.0000, Time[Data/Iter.]: [0.09s/147.43s]

The reason why Epoch 0 is 0 here is that I commented the relevant test code,Thank you very much for your valuable advice @fredzzhang

caochong12 commented 1 month ago

Hello sir, I would like to know what modifications I should make to make my training work properly, thank you for being able to answer my questions in your busy schedule, thank you very much @fredzzhang

fredzzhang commented 1 month ago

Hi @caochong12,

You need to filter out the nan values a bit earlier in the model. Refer to the post.

Fred.

caochong12 commented 1 month ago

Thank you very much for your reply, I will try to modify the code the way you said, thank you again for your work and suggestions @fredzzhang