ASMIftekhar / VSGNet

VSGNet:Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions.
MIT License
100 stars 20 forks source link

Why slice out the first object feature in the Graph Convolutional Branch?Thanks for any helpful reply #6

Closed truetone2022 closed 4 years ago

truetone2022 commented 4 years ago

The op in the script_hico/model.py, for batch_num,l in enumerate(pairs_info):

Slicing

        # (N, 1024)
        people_this_batch=people_t[start_p:start_p+int(l[0])]
        # (N,)
        no_peo=len(people_this_batch)
        # (M-1, 1024)
        objects_this_batch=objects_only[start_o:start_o+int(l[1])][1:]
        # (1, 1024)
        no_objects_this_batch=objects_only[start_o:start_o+int(l[1])][0]
        # (M-1,)
        no_obj=len(objects_this_batch)
        # ((N*M), 1)
        interaction_prob_this_batch=interaction_prob[start_c:start_c+int(l[1])*int(l[0])]
ASMIftekhar commented 4 years ago

Hello, Thanks for this excellent catch. For the V-COCO dataset the first object is always [0,0,0,0] which refers to no-object case as V-COCO has a restriction of detecting no object case as [0,0,0,0] (For details). There is no point in including a no-object case in the graph structure, so the slicing happened. HICO-DET dataset doesn't have this restriction. So there is no need for slicing in this case. While cleaning up the repo I think I forgot to notice this part of the code. This basically forces the network to ignore the first object in the graph structure. I pushed a quick fix to the issue. I will try to clean it up when I got the chance.

ASMIftekhar commented 4 years ago

Just to add, intuitively with this fix if we retrain the network we should get a little bit better results than the results reported in the paper. I just ran the inference with the model we reported in the paper and got a similar result: Full--19.79 Rare--16.19 Non-Rare--20.87 I will try to report the new results once I retrain the whole network with the bug fixed.

truetone2022 commented 4 years ago

Thanks for your very helpful reply! And after reading your great work, I have a few confusing questions about VSGNet:

  1. why not using the pose information to improve the perform of model?Is the pose information really helpful for the HOI task?
  2. Are those more complicated massage passing model like TreeLSTM, Gated GCN helpful for the the HOI task?
ASMIftekhar commented 4 years ago

Thanks for your interest in our work, 1) Well, we tried adding pose estimation as the 3rd channel in our spatial map, the little improvement in the result does not justify the huge overhead. But I think there are other works which have used pose in a different way. I personally have little reservations about adding more computationally expensive backends, most existing methods already use object detectors offline, adding pose will make it more complicated. But yes pose information should help on the HOI task.
2) I am not familiar with TreeLSTM but Gated GCN can certainly help.

Chuckie-He commented 2 years ago

@ASMIftekhar hello, I recently tried to do some experiments on your project. Which model is used to extract the pose information you said? please help, thank you.

ASMIftekhar commented 2 years ago

Alpha Pose: https://github.com/MVIG-SJTU/AlphaPose