Open zyong812 opened 3 years ago
Actually, when we began this project, we could not reproduce the performance in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs". Our model always outperform them a lot.
This is not an individual case. You may refer to this paper: "Spatial-Temporal Transformer for Dynamic Scene Graph Generation", where a similar phenomenon is found.
As for "detecting human-object relationships in videos", I haven't read it yet. I'll reply if I find some clues.
OK, thanks for replying.
I think of one possible situation. AG is actually a HOI dataset. However, the metrics in SGG such as PredCls and SGCls enumerate all possible object pairs (e.g. <shoe, bed> in a scene of person, shoe and bed).
However, our setting is in line with RelDN repo for AG dataset, thereby restricting the subject to be the person, but not for VidVRD.
Thanks for sharing the nice work!
But I find the performance presented in the paper is very different with methods in "Action Genome: Actions as Compositions of Spatio-temporal Scene Graphs" and "detecting human-object relationships in videos". What are the causes for this?