MCG-NJU / TRACE

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation
Other
58 stars 5 forks source link

Performance very different to Action Genome baselines #6

Open zyong812 opened 2 years ago

zyong812 commented 2 years ago

Thanks for sharing the nice work!

But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?

tyshiwo1 commented 2 years ago

Actually, when we began this project, we could not reproduce the performance in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs". Our model always outperform them a lot.

This is not an individual case. You may refer to this paper: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", where a similar phenomenon is found.

As for "detecting human-object relationships in videos", I haven't read it yet. I'll reply if I find some clues.

zyong812 commented 2 years ago

OK, thanks for replying.

tyshiwo1 commented 2 years ago

I think of one possible situation. AG is actually a HOI dataset. However, the metrics in SGG such as PredCls and SGCls enumerate all possible object pairs (e.g. <shoe, bed> in a scene of person, shoe and bed).

However, our setting is in line with RelDN repo for AG dataset, thereby restricting the subject to be the person, but not for VidVRD.