KAIST-AILab / SyncVSR

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization (Interspeech 2024)
https://www.isca-archive.org/interspeech_2024/ahn24_interspeech.pdf
MIT License
15 stars 1 forks source link

Issues related to face detection #17

Closed davidingram123 closed 1 hour ago

davidingram123 commented 1 hour ago

2024-10-19 201926

Sorry to bother you, I have another question. The image above shows the contents of the .pkl file corresponding to "LRW/lipread_mp4/ANSWER/train/ANSWER_00845.mp4". It’s clear that it has extracted the wrong face; the correct face should be the person a bit more to the right. Is this normal? Is it an isolated case? I randomly opened one and found it to be incorrect. Is the .pkl file corresponding to ANSWER/train/ANSWER_00845.mp4 that you extracted also showing the same issue?

snoop2head commented 1 hour ago

Good point! These edge cases aren't very common, but they do happen. I can't check my ANSWER_00845.pkl file at the moment, but I recall encountering a few similar instances in the training dataset.

This is the reason I chose Mediapipe over other face detectors because it includes a face-tracking feature. By setting max_num_faces = 1, these errors are significantly reduced compared to other face detectors.

You can see the relevant code here:
https://github.com/KAIST-AILab/SyncVSR/blob/db5e50e9677c815169c0587c17a52f20a50bd7d8/LRW/video/src/preprocess_roi.py#L17-L22

However, there's room for improvement. For instance, setting a max_num_faces = 2 and adding a script to select the centrally positioned face when multiple faces are detected could be beneficial. Currently, in the code block below, we simply select the first predominantly detected face from Mediapipe.

https://github.com/KAIST-AILab/SyncVSR/blob/db5e50e9677c815169c0587c17a52f20a50bd7d8/LRW/video/src/preprocess_roi.py#L36-L39

Despite these occasional misdetections, our model's performance is still reachable.

davidingram123 commented 1 hour ago

Thank you, I understand.