Yuqifan1117 / CaCao

This is the official repository for the paper "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World" (Accepted by ICCV 2023)
39 stars 5 forks source link

performance reproduction? #7

Closed ZHUXUHAN closed 10 months ago

ZHUXUHAN commented 10 months ago

i just map the openword classes to base and novel classes as: novel_classes = [311, 400, 348, 3, 128, 542, 321, 299, 149, 555, 260, 9, 104, 13, 331] base_classes = [2, 7, 37, 39, 42, 45, 50, 56, 70, 72, 134, 136, 153, 158, 164, 171, 173, 286, 301, 314, 318, 320, 328, 343, 360, 393, 448, 478, 513, 535, 559, 570, 571, 572, 581] and train the baseline model(motifs) using the VG+cacao dataset. but the baseline's base classses' Recall is about 10 lower than your reported results and the novel classes' Recall is 0, can you provide some suggestions for the performance reproduction?

Yuqifan1117 commented 10 months ago

Direct mapping of predicates and retraining the baseline (motifs) model in this way require learning 588 predicate features, which will restrain base classes. In order to maintain the performance of the base classes, during the training process, we mapped the VG+cacao dataset to 50 classes (1-50, and 15 of 50 are novel classes) of the target and completed the training of the baseline model based on this (training with base classes and extra unseen classes, totally is 50, not 588). (R @ 50: 0.1748) Besides, you need to pay attention to whether you are training incorrectly, if directly train the motifs model instead of contrastive learning, there is no gradient decline on the novel classes, so the recall is 0. (R @ 50: 0.1122)

ZHUXUHAN commented 10 months ago

Thank you for your patient answer. Another question is, if some unseen classes are introduced in the training, can it still be called unseen or novel classes?

Yuqifan1117 commented 10 months ago

Our main purpose is to verify the additional effect of the extra data from CaCao. We have only narrowed down the scope of the CaCao pseudo-predicates, but it is invisible to the ground truth of the novel classes (This is our uniform setting).

ZHUXUHAN commented 10 months ago

My understanding is that the labels of these training novel classes samples come from pseudo labels not from the ground truth, so they can also be called unseen, is it right?

Yuqifan1117 commented 10 months ago

Yes (can also be called unseen), but sampling from pseudo labels may be knowable about the target categories, so it's probably easier than a strictly unseen setting.

ZHUXUHAN commented 10 months ago

Yes, this is a weak unseen, compared to the traditional open vocabulary setting or zero-shot settings. I am running your data set recently, and hope that the results of this latest experiment are in line with expectations. Thank you for your answer.

ZHUXUHAN commented 10 months ago

i just meet another problem, the test dataset is original vg dataset (with 50 predicates) or your provided VG+Cacao dataset (with 50 predicates as vg).

Yuqifan1117 commented 10 months ago

It should be the latter. Have you solved the problem of performance reproduction?

ZHUXUHAN commented 10 months ago

Ok. The training will be completed later today, but it is not finished yet

the base recall is 11.3 novel is 6.9 all is 9.9, are these results reasonable?

do you use the motifs's frequence bias trick?

and i find the sample number of mapping test images is 2262? is it right?

Yuqifan1117 commented 10 months ago

These results might be reasonable (reflect the improved performance of CaCao). However, the final performance is affected by the quality of the mapping predicates. Besides, we consider the influence of frequency (but did not use any information from the unseen ground truth). Therefore, higher performance can be achieved. Finally, we didn't make additional filtering on the test images, only dividing the predicate category for the test (looks like more than 2,262).

ZHUXUHAN commented 10 months ago

These results might be reasonable (reflect the improved performance of CaCao). However, the final performance is affected by the quality of the mapping predicates. Besides, we consider the influence of frequency (but did not use any information from the unseen ground truth). Therefore, higher performance can be achieved. Finally, we didn't make additional filtering on the test images, only dividing the predicate category for the test (looks like more than 2,262).

another question is do you only considering the labeled pair, and don't considering the other pairs (most is N*N for N objects)?

Yuqifan1117 commented 10 months ago

consider total N*N object pairs, but the total objects are known in PREDCLS.

ZHUXUHAN commented 10 months ago

consider total N*N object pairs, but the total objects are known in PREDCLS.

my reproduced motifs's perfomance is Top 100: Base: 0.1267 Novel: 0.0813 All: 0.1138. which is worser than your provided baseline. i think maybe my test images are not right. the filterd images are 2,262, which is small amount.

Yuqifan1117 commented 10 months ago

You might consider the influence of frequency as well as your test images, maybe better.

ZHUXUHAN commented 10 months ago

You might consider the influence of frequency, maybe better.

yes i just use this trick.

ZHUXUHAN commented 10 months ago

novel_classes = [311, 400, 348, 3, 128, 542, 321, 299, 149, 555, 260, 9, 104, 13, 331] base_classes = [2, 7, 37, 39, 42, 45, 50, 56, 70, 72, 134, 136, 153, 158, 164, 171, 173, 286, 301, 314, 318, 320, 328, 343, 360, 393, 448, 478, 513, 535, 559, 570, 571, 572, 581] all_classes_map = {v: i + 1 for i, v in enumerate(all_classes)} all_classes = novel_classes + base_classes for i, relations in enumerate(original_relationships): new_relation = [] for rels in relations: if rels[2] in all_classes: new_relation.append(np.array([[rels[0], rels[1], all_classes_map[rels[2]]]])) this is my filtering code. if the annotations do not contain the wanted classes, this image will be filtered.

Yuqifan1117 commented 10 months ago

In general, test images are not extended and should not contain additional categories of predicates.

ZHUXUHAN commented 10 months ago

m, the test dataset is original vg dataset (with 50 predicates) or your provided VG+Cacao dataset (with 50 predicates as vg).

but the VG+Cacao test dataset contains other categories of predicates (not only 50 categories), maybe you mean the original vg test dataset?

Yuqifan1117 commented 10 months ago

I remember we filtered unneeded triplets in predicate-level instead of in image-level during inference, hoping to help you.

ZHUXUHAN commented 10 months ago

I remember we filtered unneeded triplets in predicate-level instead of in image-level during inference, hoping to help you.

i think we do the same, if an image's annotaions have more than one wanted predicates, it will be used, if it has no one wanted predicates, it will not be used.

ZHUXUHAN commented 10 months ago

I remember we filtered unneeded triplets in predicate-level instead of in image-level during inference, hoping to help you.

i think we do the same, if an image's annotaions have more than one wanted predicates, it will be used, if it has no one wanted predicates, it will not be used.

the test dataset is "./open-world/VG-SGG-zs-random-EXPANDED-with-attri.h5"?

Yuqifan1117 commented 10 months ago

but the VG+Cacao test dataset contains other categories of predicates (not only 50 categories), maybe you mean the original vg test dataset?

Sorry, I seem to have evaluated the VG+Cacao dataset, but the valid image (has predicates in target predicate classes) is more than 2262. Then it looks like more than a small amount. We found that the corresponding dictionary in 'zs-random' seems to be contaminated, I updated the idx of base classes and novel classes as follows:

base_classes = [2, 7, 16, 18, 21, 23, 26, 32, 45, 47, 109, 111, 128, 132, 137, 143, 145, 182, 197, 210, 213, 215, 223, 237, 254, 284, 335, 362, 396, 415, 439, 450, 451, 452, 454] novel_classes = [207, 291, 242, 3, 103, 422, 216, 195, 124, 435, 156, 9, 79, 13, 226]

I will check and update the 'idx_to_predicate' and 'predicate_to_idx' information in the future.

ZHUXUHAN commented 10 months ago

base_classes = [2, 7, 16, 18, 21, 23, 26, 32, 45, 47, 109, 111, 128, 132, 137, 143, 145, 182, 197, 210, 213, 215, 223, 237, 254, 284, 335, 362, 396, 415, 439, 450, 451, 452, 454] novel_classes = [207, 291, 242, 3, 103, 422, 216, 195, 124, 435, 156, 9, 79, 13, 226]

ok, i will retrain this baseline model with this map.

ZHUXUHAN commented 10 months ago

hi, my motif baseline' recall of base is about 20, novel is about 10, it seems the novel is also a little lower, do you upload the epic model? it seems it is much better than the baseline. By the way, according to what you said before, you used N*N pairs, but according to your paper description, epic seems to perform prompt learning at the triplet-level. I think the amount of calculation is huge on some samples, because for some samples, the number of pairs is huge, how do you solve this problem.

Yuqifan1117 commented 10 months ago

It seems reasonable that a little lower could be caused by the quality of CACAO triplet mapping. According to the next works and other cooperators, we will sort out this part of the code later. Thank you for your understanding.

for some samples, the number of pairs is huge, how do you solve this problem.

For training, we use labeled pairs; while in inference, we perform prompt learning on pseudo-label pairs by removing some candidate samples by confidence. (The overhead is acceptable). By the way, we consider triplet-level only on image-aware prompts and adopt predicate-level on text-aware prompts, thus the overhead is also acceptable.