Is the model capable of detecting open-vocabulary objects, such as grounding dino?

JacobYuan7 / RLIPv2

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Apache License 2.0

110 stars 3 forks source link

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

Open aixiaodewugege opened 1 year ago

aixiaodewugege commented 1 year ago

Thanks for your brilliant work!

I'm wondering if the model can detect all objects, such as a 'grounding dino'?

JacobYuan7 commented 1 year ago

@aixiaodewugege
Hi, many thanks for your interest in my work. Yes, you've grasped the concept accurately. Due to the annotation style of Visual Genome and the pseudo-labelled Objects365, it has such ability. Nevertheless, I have to confess that it is not as capable as Grounding DINO, primarily due to the dataset's scale and the nature of the pseudo-labelled annotations.

aixiaodewugege commented 1 year ago

Thanks for your reply!

Have you considered the possibility that HOI could enhance the accuracy of action recognition problems, like Kinetics400, given that you claim your model has superior zero-shot performance?

JacobYuan7 commented 1 year ago

@aixiaodewugege My answer is yes. It's definitely reasonable that it can boost the performance of action recognition if introducing extra cues from a relation detection model. However, if I were you, I would start from a fine-tuned model (Swin-L perhaps) since the HICO-Det dataset covers a wide range of object classes and verb classes.

aixiaodewugege commented 1 year ago

Thanks!

Do you have any suggestions on how I can integrate a video-based action recognition model with an image-based HOI model? Should I use the same image encoder, like mPLUG?

JacobYuan7 commented 1 year ago

@aixiaodewugege I doubt the way of utilizing a joint image encoder since detection backbones usually require fine-tuning for the detection head. It is only viable when the backbone for action recognition model and the backbone for HOI detection are jointly trained.

aixiaodewugege commented 1 year ago

Thanks！

I'm new to the HOI (Human-Object Interaction) and action recognition tasks. They have been using different dataset. I'm curious as to why there haven't been attempts to combine them, given that it seems logically beneficial for both tasks.

JacobYuan7 commented 1 year ago

@aixiaodewugege It is definitely worth trying. But I speculate that as long as they target publishing research papers, its novelty requires a second assessment.

aixiaodewugege commented 1 year ago

Thanks for your patient! It is really helpful!

I have few question about your method.

Is Label Sequence means the caption about the image? If so, does it mean when doing inference on a image, we have to use blip to generate its caption first? Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence?
RLIPv2 is reuse to tag the relation. How the first version RLIPv2 is trained? Is it trained on VG dataset for It has tagged the relation?

JacobYuan7 commented 1 year ago

@aixiaodewugege

With respect to the label sequence, it is indeed a sequence of labels rather than a whole caption. You can refer to RLIPv1 for clearer illustrations. Thus, this sequence can be dataset-specific rather than image-specific. For example, when we perform fine-tuning on HICO-DET, the text labels in the label sequence are identical for all images, which are all possible object texts and verb texts in HICO-DET. Back to your question, this understanding Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence? is correct.
I do not understand what the first version is. If you mean by 'how R-Tagger is trained', I would recommend reading Sec.4.2.2.