Closed Howardkhh closed 3 weeks ago
There are three papers linked to this repo, so it is hard to know which one you are referring to. But I will still try to answer you question.
In this repo, the so-called "detection" in 2D was done using other approaches. In the work proposed by Wald 2020, the detection uses ground truth segmentation, and the scene graph estimation predicts the class of the objects and the type of connections between a pair of objects. My two methods both involve potential segmentation on an object. For example, an incremental segmentation method is used in SceneGraphFusion. That is why the sentence "Moreover, since different segmentation methods may result in different numbers of segments, we map all predictions on estimated segmentation back to ground truth." exist. For my latest work, it uses bounding box overlapping, similar to 2D methods, to determine if an object is correctly detected. However, the over-segmentation still exists, so the sentence is also applied.
Since we use the recall metric, the evaluation is about how many objects and their predicate can be detected and predicted correctly. I can't remember if I provide precision as well in the paper, but if you can also get that value using this repo.
Thank you for your response, and I apologize for the confusion. The paper I was referring to is "Incremental 3D Semantic Scene Graph Prediction from RGB Sequences," which I will refer to as "JointSSG" later on.
Based on your explanation, does the sentence "Moreover, since different segmentation methods may result in different numbers of segments, we map all predictions on estimated segmentation back to ground truth" correspond to the statement "In the case of multiple segments corresponding to the same object instance, we add the same part relationship between all of them, as shown in Fig. 4." from Section 5 (Data Generation) of the SceneGraphFusion paper?
Additionally, is the discrepancy in performance reported between SceneGraphFusion and JointSSG due to the different methods used to match estimated segments with ground truth objects? In Section 5 (Data Generation) of the SceneGraphFusion paper, the area of intersection between estimated segments and ground truth objects is used as the matching criterion in the detection. However, as you mentioned, JointSSG uses bounding box overlapping for this purpose.
SceneGraphFusion:
JointSSG:
Those two sentences do not describe the same thing. "Moreover, since different segmentation methods may result in different numbers of segments, we map all predictions on estimated segmentation back to ground truth" talks about how we evaluate the performance of the methods using different segmentation methods. To compare them with each other, we map all the predictions back to the ground truth segmentation and evaluate the prediction results.
"In the case of multiple segments corresponding to the same object instance, we add the same part relationship between all of them, as shown in Fig. 4." talks about how we estimate "instances" from our over-segmentation approaches. We predict which segments are belong to the "same part".
The above two are applied in different evaluations. For the first sentence, it is used for evaluating the class prediction on the ground truth objects. The second sentence is used to estimate the instance (panoptic) segmentation.
The inconsistency in the model performance mainly come from the training data. For 3DSSG, the numbers are close since it uses the ground truth data to train. For SGFN, it trains on a dense segmentation method which does not guarantee to produce exactly the same segmentation.
I get it now. Thank you very much for your support!
Thank you for the great work! I have some questions regarding the evaluation metric used in the paper and the code. In table 1 of the paper, there are three recall metrics reported: Relationship, Object, and Predicate. I want to confirm if my understanding on the definitions are correct.
Moreover, I am curious about the definition of localized objects. For example, in 2D scene graph generation, a object is localized (or detected) if the IoU overlap with a ground truth object is greater than 0.5. Did you also use the IoU of oriented bounding boxes, or the Average Overlap Score?
The other question regards the paragraph
Evaluation Metric
in section 4.2 of the paper. I want to know the meaning of the sentence: "Moreover, since different segmentation methods may result in different number of segments, we map all predictions on estimated segmentation back to ground truth." Can you point out the part in the code that maps all predictions on estimated segmentation back to ground truth?Thank you very much.