[Question] SGDet vs SGCls (VG)

sharifza commented 4 years ago

I have a question. I don't understand why (in Visual Genome) SGDet gains such a small improvement compared to Neural Motifs whereas SGCls has gains such a larger improvement? Isn't the only difference in the region proposal network?

sharifza commented 4 years ago

Now I understand that your reported numbers are in fact not comparable to Neural Motifs. I consider this some sort of [unintended?] mistake in reporting the results.

In NM (and most of the previous works), SGCls is defined as a setting where bounding boxes are given, while edges are not, and we evaluate the quality of "detected" and "classified" edges. In your work, you have updated the definition of SGCls to a setting where bounding boxes and edges are given and the goal is to evaluate the quality of "classifying" edges. While I understand your motivation behind this change (given the name "Scene Graph Classification"), putting these under the same title in the table, will totally mislead the community.

bknyaz commented 4 years ago

@sharifza if you could share the code fixing the evaluation of the models in this repo, it would be great! I still see they rank triplets here https://github.com/NVIDIA/ContrastiveLosses4VRD/blob/master/lib/datasets_rel/task_evaluation_vg_and_vrd.py#L84, so I'm not sure where exactly their evaluation goes wrong.

sharifza commented 3 years ago

@bknyaz I avoided using this repository for my research. No one responded to my complaint for a year. The mentioned evaluation issue affects the heart of this paper's contribution and questions the validity of everything. There are other repositories that I recommend you to take a look at: Neural Motifs [PyTorch 0.3], Depth-VRD (Neural Motifs [PyTorch > 1.0]), and the recent benchmark by @kaihuatang. Kaihua also pointed out this issue here. (Two Common Misunderstandings in SGG Metrics).

sigeek commented 3 years ago

The main problem is that the evaluation for VRD and VG is done in the same file even if the metrics are slightly different. The metrics used in VRD are the following:

predicate detection (PredDet): predicate prediction given a pair of localized objects (both bounding boxes and labels);
phrase detection (PhrDet): locate the phrase (subject, predicate, object) in the image with a unique bounding box;
relationship detection (RelDet): define the triplets (subject, predicate, object) with a pair of bounding boxes.

The metrics used in VG are:

predicate classification (PredCls): predict the relationships (edges) among object pairs given a set ground-truth bounding boxes and labels;
phrase classification (PhrCls) or scene graph classification (SGCls): predict the triplets (subject, predicate, object) (edges and labels) given a set of localized objects;
scene graph generation (SGGEN) or (SGDet): predict the bounding boxes and the triplets in the image, an object is considered correct if it has at least 0.5 IoU overlap with the ground-truth bounding box.

In PredDet, the pairs (subject, object) are given as pointed in this issue, whereas in PredCls and SGCls are not. This is the problem related to this implementation.

Hope this helps! 👍

NVIDIA / ContrastiveLosses4VRD

[Question] SGDet vs SGCls (VG) #20