bowen-upenn / scene_graph_commonsense

This is the official implementation of the paper "Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge" in PyTorch.
https://arxiv.org/abs/2311.12889
MIT License
18 stars 2 forks source link

A few questions about the approach #2

Closed Maelic closed 8 months ago

Maelic commented 8 months ago

Dear authors,

Thank you for sharing this amazing work, after reading your paper I do have a few questions regarding the approach:

  1. How did you classify the different relations between the 3 categories (i.e. geometric, semantic and possessive) taking into account polysemic usage of predicates (such as "on" that could be used in the 3 categories for instance)?
  2. Did you test the alignment of GPT3.5 for commonsense validation with human knowledge (i.e. asking annotators to validate the prediction of GPT3.5 on a subset of the data)?
  3. What is the impact of the addition of the depth maps on the 3 categories (i.e. it would make sense that the geometric category is more impacted by it)?
  4. From my experiments, the VG dataset is also highly unbalanced between relations categories (i.e. it has a vast majority of geometric annotations), does this impact the training of the bayesian heads?

I know these are a lot of complex questions to answer but I'm looking forward to open a discussion on any of these :)

Best

bowen-upenn commented 8 months ago

Hi Maelic,

Thank you for your interest in our work!

1. How did you classify the different relations between the 3 categories (i.e. geometric, semantic and possessive) taking into account polysemic usage of predicates (such as "on" that could be used in the 3 categories for instance)?

There are two ways we can classify relations. First, we follow the definitions provided in Neural Motifs to categorize them into geometric, possessive, and semantic relations. We ensure that parent classes are mutually exclusive, so "on" is only included in the geometric relations.

In Visual Genome: Geometric relations

0: 'above', 1: 'across', 2: 'against', 3: 'along', 4: 'and', 5: 'at', 6: 'behind', 7: 'between', 8: 'in', 9: 'in front of', 10: 'near', 11: 'on', 12: 'on back of', 13: 'over', 14: 'under',

Possessive relations

15: 'belonging to', 16: 'for', 17: 'from', 18: 'has', 19: 'made of', 20: 'of', 21: 'part of', 22: 'to', 23: 'wearing',

Semantic relations

24: 'wears', 25: 'with', 26: 'attached to', 27: 'carrying', 28: 'covered in', 29: 'covering', 30: 'eating', 31: 'flying in', 32: 'growing on', 33: 'hanging from', 34: 'holding', 35: 'laying on', 36: 'looking at', 37: 'lying on', 38: 'mounted on', 39: 'painted on', 40: 'parked on', 41: 'playing', 42: 'riding', 43: 'says', 44: 'sitting on', 45: 'standing on', 46: 'using', 47: 'walking in', 48: 'walking on', 49: 'watching'

Secondly, we can adjust the framework to use unsupervised clustering like k-means on pretrained embedding spaces to construct the relation hierarchy. To choose different approaches, you can set the dataset.supcat_clustering parameter in configs.yaml to motif, gpt2, bert, clip, where motif refers to the manual clustering described above. Table 1 below shows results using GPT-2, BERT, and CLIP text embedding space. CLIP performs comparably to the manual one with even higher mR@k scores, so we suggest that CLIP could be used to generalize the relation hierarchy to other datasets you might be working with without the need for manual clustering.

Methods R@20 R@50 R@100 mR@20 mR@50 mR@100
Manual 61.1 73.6 78.1 14.4 20.6 23.7
CLIP-Text 61.6 72.7 76.8 17.5 25.9 30.0
GPT-2 61.6 69.9 72.0 16.9 25.0 29.0
BERT 61.5 69.7 72.5 16.2 23.0 27.1

Note that this table compares different clustering spaces for relation hierarchy for illustration purposes only. The performance does not involve the commonsense validation pipeline. We will update our paper to include this study soon.

2. Did you test the alignment of GPT3.5 for commonsense validation with human knowledge (i.e. asking annotators to validate the prediction of GPT3.5 on a subset of the data)?

During an earlier stage of our experiment, we found that GPT3.5 has a remarkable capability to filter out predictions that violate commonsense in scene graphs, given prompts are properly engineered. This is the reason we decided to use it as our commonsense validator in our paper. In Figure 3 of the paper, all black arrows passed GPT3.5's validation, but many blue arrows (incorrect predictions) disappeared in the second row when GPT3.5 was applied after the third row. Unfortunately, as a small academic group, we don't have the resources to find other people as human validators, especially when the dataset size is large (there are many more predicted edges the model will generate than the number of images in the dataset).

3. What is the impact of the addition of the depth maps on the 3 categories (i.e. it would make sense that the geometric category is more impacted by it)?

Yes, we agree with you that depth maps should help with the prediction, particularly in understanding geometric relationships. There is an ablation study in the paper on the depth maps. Feel free to refer to Table 4, row ours w/o [d] in the paper for the results. We have copied and pasted this sub-table here for your convenience.

Ablation R@20 R@50 R@100 mR@20 mR@50 mR@100
ours (final) 64.2 75.5 79.1 17.5 23.9 26.6
ours w/o depth maps 62.5 74.2 78.5 15.5 21.6 24.1

4. From my experiments, the VG dataset is also highly unbalanced between relations categories (i.e. it has a vast majority of geometric annotations), does this impact the training of the bayesian heads?

The annotations in Visual Genome are known to contain a significant amount of human bias, and that's why the top one predictions from the other two (possessive and semantic) heads in the Bayesian heads could help construct a more comprehensive scene graph with the information from different perspectives.

The training of the Bayesian head involves separate Cross-Entropy losses, one on each head. Following the index order mentioned above, we find the following number of relations in the training dataset.

relation_count = [47342, 1996, 3092, 3624, 3477, 9903, 41363, 3411, 251756,
13715, 96589, 712432, 1914, 9317, 22596, 3288, 9145, 2945,
277943, 2312, 146339, 2065, 2517, 136099, 15457, 66425, 10191,
5213, 2312, 3806, 4688, 1973, 1853, 9894, 42722, 3739,
3083, 1869, 2253, 3095, 2721, 3810, 8856, 2241, 18643,
14185, 1925, 1740, 4613, 3490]

In our train_test.py, we calculate the weights in Cross-Entropy losses as follows, hoping to reduce the class imbalance problems.

class_weight = 1 - relation_count / torch.sum(relation_count)

In order to further reduce the class imbalance problems, there are two directions ahead. First, we have tried to combine our methods with other existing works specialized in solving long-tailed problems in the scene graphs. See scenegraph_benchmark. Experimental results show that we can raise their SOTA mR@k scores to continue reducing long-tailed problems, while simultaneously achieving even higher R@k scores. We hope such an integration could help with the unbalanced annotations, as reflected on the mR@k metrics. We will emphasize this table in a revised version of our paper.

Methods R@20 R@50 R@100 mR@20 mR@50 mR@100
Motifs+TDE 33.6 46.2 51.4 18.5 25.5 29.1
Motifs+TDE+Ours 39.7 56.9 66.7 20.1 28.8 34.9
VCTree+TDE 36.2 47.2 51.6 18.4 25.4 28.7
VCTree+TDE+Ours 39.6 56.9 66.6 19.6 28.6 35.2
Motifs+NICE - 55.1 57.2 - 29.9 32.3
Motifs+NICE+Ours 43.1 58.2 65.4 22.6 33.1 39.8
Motifs+IETrans - 48.6 50.5 - 35.8 39.1
Motifs+IETrans+Ours 47.9 60.4 66.4 26.4 38.0 44.1

Secondly, we are working with a zero-shot approach to generate local scene graphs and carry out downstream tasks such as visual question answering, leveraging large vison-language models. This approach aims to eliminate any human bias in the training dataset annotations, so please stay tuned for updates!

Maelic commented 8 months ago

Thank you very much for your comprehensive answers!

1. Regarding the distribution of relations categories, my point is that in a noisy dataset like VG it is not very efficient to categorize relations by taking into account only the predicate, as we have a lot of different usage for each predicate. For instance, "on" can be used in geometric relations such as "cup on table" but also semantic such as "person on laptop". I tried different strategies for this problem such as mapping with ConceptNet for every triplet (see one of my latest works at the SG2RL workshop of ICCV 2023). Lately, I have fine-tuned GPT3.5 to classify all the entire triplets (not just predicates) in VG between relations types, and here is what I got:

image

(here I split the possessive category between part-whole and attribute because the classification by Zellers et al. did not make a lot of sense to me)

In this plot, we can see that around half of the predicates are indeed used in the dataset for one specific meaning but the other half possess mixed interpretation, especially for general predicates such as "has", "on" etc, as explained beforehand. This is why I am currently working on a new approach to learn the mapping of the entire triplet to a specific category, instead of a hard-coded classification which limits the ability of models to comprehend the complex nature of natural language in my opinion.

3.

I was referring in my question to an ablation study for each category independently, such as:

ours (geometric): ... ours (geometric) w/o depth maps: ... ours (semantic): ... ours (semantic) w/o depth maps: ... ours (possessive): ... ours (possessive) w/o depth maps: ...

To compare the delta between categories, intuitively the geometric one should benefit more from the addition of the depth maps.

Anyway, I would be more than happy to discuss these issues more in-depth directly! Here is my LinkedIn: www.linkedin.com/in/maëlic-neau-b55a52149 and my email: neau0001@flinders.edu.au.

Best

bowen-upenn commented 8 months ago

This plot is really intriguing and provides more fine-grained insights into these relation types. I am pleased to read through this plot. Thanks for sharing it with me!

I find that many works in scene graph literature are motivated by the strong human bias in the Visual Genome dataset annotations, and therefore, an open-ended solution would be an interesting future direction in generating scene graphs, especially task-oriented local scene graphs. I hope to release a preliminary work in the next two weeks.

Regarding the detailed ablation study for each category, I agree with your intuition and agree that it will be helpful to show. I am a little bit busy with another work this week, but I will get back to you as soon as I have the results, probably in the next week.

You can also reach me at bwjiang@seas.upenn.edu, and I am happy to discuss them with you more in-depth!