facebookresearch / Ego4d

Ego4d dataset repository. Download the dataset, visualize, extract features & example usage of the dataset
https://ego4d-data.org/docs/
MIT License
340 stars 47 forks source link

Which benchmarks are audiovisual benchmarks? #81

Closed Hou9612 closed 2 years ago

Hou9612 commented 2 years ago

Thanks for this wonderful work!
The paper mentioned that "Audio-Visual Diarization" and "Talking to me" benchmarks both have audio and visual inputs. So I want to know that is all else benchmarks DON'T have audio inputs? If not, which else benchmarks are audiovisual benchmarks?

miguelmartin75 commented 2 years ago

AV is the only benchmark with explicit annotations on top of the audio data (i.e. annotations corresponding to people speaking).

All benchmarks will have some videos with audio on them (and some with no audio). However, the baseline implementations have not utilized the audio inputs (obviously outside of AV). The audio inputs may be beneficial to some benchmarks, such as Episodic Memory (likely Moments or NLQ), but that is for someone to experiment with.

Also incase you haven't found it I would recommend this page for documentation: https://ego4d-data.org/docs/benchmarks/AV-diarization/#task-definition

Hope that answers your question :)

Hou9612 commented 2 years ago

@miguelmartin75 Thanks for your kind reply! Can I interpret annotations as ground-truth ?

miguelmartin75 commented 2 years ago

Of course. Note that, just as any other dataset, there may be noise in the ground truth (e.g. human error) and as documented in the paper, the AV annotations do not cover the entire dataset.

I would recommend reading the annotation guidelines on the AV annotations such that you can use the data more effectively. See here: https://ego4d-data.org/docs/data/annotation-guidelines/#audio-visual-diarization--social-avs

Hou9612 commented 2 years ago

@miguelmartin75 Got it. Thanks very much!