Closed aixiaodewugege closed 1 year ago
@aixiaodewugege
Thanks for your reply~~
Do you know any work that focus on grounding open vocabulary detection?
@aixiaodewugege You can refer to link.
Thanks~~ But I am looking for a better one, since grounding_dino only release a tiny version. The performance is not satisficed.
@aixiaodewugege As demonstrated in our paper, the ckpt of ONE-PEACE fine-tuned on RefCOCOg exhibits some capabilities in grounding open vocabulary detection. For better performance, it's advisable to collect more grounding datasets to train ONE-PEACE. I think this will yield a strong model for grounding open vocabulary detection.
Thanks!
By the way, I'm quite intrigued as to how the model identifies Tony Tony Chopper. I mean, where does it acquire such knowledge from?
And are you planning to share a inference script rather than for evaluation?
I think it acquires this knowledge from pretraining. The pre-training datasets used by ONE-PEACE may contain a large number of anime images. ONE-PEACE implicitly learns to associate the anime characters (text) with their corresponding regions in the images during pretraining. Fine-tuning on the grounding datasets simply instructs the model on how to "outputs" the corresponding regions.
I consider providing a Colab notebook to reproduce the cases in the paper, but I'm uncertain about when it will be ready. Maybe next week, I guess.
@aixiaodewugege Hi, we have provided the visual grounding API here. The results of our API are even better than what was reported in the paper, as it is capable of accurately locating Brook. Have fun :)
@aixiaodewugege Hi, we recently evaluated ONE-PEACE on VGGSound using both vision and audio information, and we achieved a score of 68.2, a new SOTA in this dataset. We hope this information is helpful to you.
Hi, good to hear that! I think VGGSound is a dataset where sound plays an important role in determining the label. How about Kinetics400? Do you think audio will improve the results? Additionally, have you considered replacing the language adapter with an pretrained LLM?
@aixiaodewugege
Hi, thanks for your brilliant work!
I am curious about why don't you combine the representation from vision and audio to video classification task, since you have got them already~~
Also can one peace be used for zero shot detection or open vocabulary detection?