OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
964 stars 63 forks source link

Fine-grained alignment between audio and other modalities #10

Closed Ming-er closed 1 year ago

Ming-er commented 1 year ago

Hi, thanks for your work, really nice one! I notice that the fine-grained alignment between vision and language could be verified by conducted experiments on corresponding tasks (such as referring image segmentation), however, I think that the fine-grained alignment between audio and text might not be validated by downstream tasks you choose since AQA, audio classification, and audio-text retrieval are not sensitive to temporal orders or temporal locations. So, are there any results for some low-level audio tasks such as sound event detection or audio grounding?

logicwong commented 1 year ago

@Ming-er Thank you for your suggestion, we haven't conducted experiments on sound event detection or audio grounding. Could you provide the links to the sound event detection and audio grounding datasets? We didn't explore these datasets when conducting audio experiments.

Ming-er commented 1 year ago

For SED, you could refer https://github.com/DCASE-REPO/DESED_task while for TAG, you could refer https://github.com/wsntxxn/TextToAudioGrounding

logicwong commented 1 year ago

For SED, you could refer https://github.com/DCASE-REPO/DESED_task while for TAG, you could refer https://github.com/wsntxxn/TextToAudioGrounding

Thanks a lot! I will make time to conduct experiments on these two datasets.

logicwong commented 1 year ago

Apologies for my delayed response. Over the past few months, I have been occupied with other projects and haven't had sufficient time to test these datasets thoroughly. I would greatly appreciate it if you could help me test them and provide the scores.

logicwong commented 1 year ago

This issue is temporarily closed. If there are any relevant results later, we can update them in the repository.