OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
935 stars 57 forks source link

Why don't use the vision and audio together for video classification? #19

Closed aixiaodewugege closed 1 year ago

aixiaodewugege commented 1 year ago

Hi, thanks for your brilliant work!

I am curious about why don't you combine the representation from vision and audio to video classification task, since you have got them already~~

Also can one peace be used for zero shot detection or open vocabulary detection?

logicwong commented 1 year ago

@aixiaodewugege

  1. Certainly, we could combine the vision and audio representations to the video classification task, but due to time and resource constraints, we have not yet attempted this in this version.
  2. I think ONE-PEACE can be used for zero-shot detection and open vocabulary detection, @simonJJJ can you give some advice?
aixiaodewugege commented 1 year ago

Thanks for your reply~~

Do you know any work that focus on grounding open vocabulary detection?

simonJJJ commented 1 year ago

@aixiaodewugege You can refer to link.

aixiaodewugege commented 1 year ago

@aixiaodewugege You can refer to link.

Thanks~~ But I am looking for a better one, since grounding_dino only release a tiny version. The performance is not satisficed.

logicwong commented 1 year ago

@aixiaodewugege As demonstrated in our paper, the ckpt of ONE-PEACE fine-tuned on RefCOCOg exhibits some capabilities in grounding open vocabulary detection. For better performance, it's advisable to collect more grounding datasets to train ONE-PEACE. I think this will yield a strong model for grounding open vocabulary detection.

aixiaodewugege commented 1 year ago

Thanks!

By the way, I'm quite intrigued as to how the model identifies Tony Tony Chopper. I mean, where does it acquire such knowledge from?

And are you planning to share a inference script rather than for evaluation?

logicwong commented 1 year ago

I think it acquires this knowledge from pretraining. The pre-training datasets used by ONE-PEACE may contain a large number of anime images. ONE-PEACE implicitly learns to associate the anime characters (text) with their corresponding regions in the images during pretraining. Fine-tuning on the grounding datasets simply instructs the model on how to "outputs" the corresponding regions.

I consider providing a Colab notebook to reproduce the cases in the paper, but I'm uncertain about when it will be ready. Maybe next week, I guess.

logicwong commented 1 year ago

@aixiaodewugege Hi, we have provided the visual grounding API here. The results of our API are even better than what was reported in the paper, as it is capable of accurately locating Brook. Have fun :)

logicwong commented 1 year ago

@aixiaodewugege Hi, we recently evaluated ONE-PEACE on VGGSound using both vision and audio information, and we achieved a score of 68.2, a new SOTA in this dataset. We hope this information is helpful to you.

aixiaodewugege commented 1 year ago

Hi, good to hear that! I think VGGSound is a dataset where sound plays an important role in determining the label. How about Kinetics400? Do you think audio will improve the results? Additionally, have you considered replacing the language adapter with an pretrained LLM?

logicwong commented 1 year ago

@aixiaodewugege

  1. I think the audio can improve the results of Kinetics400 too.
  2. How about directly transforming a pretrained LLM into ONE-PEACE? We could simply add some adapters and FFNs to the existing LLM and perform inter/intra cl loss, using this newly released powerful LLM QWen.