facebookresearch / muavic

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Other
363 stars 31 forks source link

Incorrect mouth crops #23

Open ahaliassos opened 1 month ago

ahaliassos commented 1 month ago

Hi,

Congrats on your work!

I have run the scripts to download and extract the clips, but when I tried to inspect some clips I noticed that they don't depict the mouth, for example for muavic/es/video/test/jej8qlzlAGw/jej8qlzlAGw_0018.mp4. I wonder if the face detector / alignment failed for many examples.

Could you please point me to a path (in the form of e.g., muavic/es/video/test/jej8qlzlAGw/jej8qlzlAGw_0018.mp4) of a cropped video that is centered around the mouth so that I can check that it's also the case for me locally? Because I have been looking at videos and I can't find one that is centered around the mouth and I'm wondering if I did something wrong.

Many thanks!

longkhanh-fam commented 1 month ago

I also encounter the same problem for ar and de languages currently. The mted folder still contain the videos but the output folders just contain black screen videos instead of cropped mouth ones.

Have you fixed it?

roudimit commented 1 month ago

Hey @ahaliassos, I wrote up some suggestions and put the cropped video / landmarks here for debugging

sungnyun commented 1 month ago

I'm facing the same problem, all videos are just black screen while the original videos are okay. Is the metadata broken?

+) The landmark metadata @roudimit provided is the same with mine. I think the problem comes from cropping.

sungnyun commented 1 month ago

OK, I think I found the reason. The main reason was that my youtube video download was somehow set to be 360p. So the pixel range in the metadata was beyond the video resolution.

You'd better check if the downloaded videos were 1080p res, which metadata should be based on.

longkhanh-fam commented 1 month ago

@sungnyun you're right; however, not all videos are in 1080p. For example, the video in de/video/test/r2tvb4-i4EE is only 720p. While 720p works, it doesn’t provide the best quality. When I compared the output of this 720p video with @roudimit provided video, the results weren’t identical. This may affect the fairness of further benchmarking. Do you have any idea?

roudimit commented 1 month ago

Nice debugging! I checked ffprobe mtedx/video/de/test/r2tvb4-i4EE.mp4 and the output was:

  Duration: 00:11:10.52, start: 0.000000, bitrate: 1386 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 1257 kb/s, 25 fps, 25 tbr, 90k tbn, 50 tbc (default)

So I guess my video is also in 720p. In general, I'm guessing some are 1080p and some are lower resolution. @longkhanh-fam what does your cropped video look like?

longkhanh-fam commented 4 weeks ago

Sorry for my late reply. I've uploaded my cropped videos here. Comparing our videos, mine clearly includes the eyes and nose, whereas yours doesn’t. I also checked the video statistics using ffprobe, and they match. This discrepancy could be due to my code and the supported quality of video, I'll check it. Duration: 00:11:10.56, start: 0.000000, bitrate: 1111 kb/s Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, bt709), 1280x720 [SAR 1:1 DAR 16:9], 1109 kb/s, 25 fps, 25 tbr, 90k tbn, 50 tbc (default)

Additionally, could you please double-check the remaining video? The specific path is muavic/de/video/test/pR_8SsedSLI/pR_8SsedSLI_0000.mp4. I noticed that the mouth isn't fully captured in the crop. Have you encountered a similar issue?

Sorry to bother you with this, and thank you for your help!

roudimit commented 3 weeks ago

Your dropbox link doesn't work, can you check the settings? I noticed your bitrate is lower so maybe your video downloaded from YouTube with less quality.

Here's 'muavic/de/video/test/pR_8SsedSLI/pR_8SsedSLI_0000.mp4', as you can see most of the video doesn't have a speaker, and then the final part of the video isn't cropped on the person's face. https://github.com/user-attachments/assets/e28ab1dd-98b2-44bc-9115-9d7d5cad9d8c

FYI it's a known issue that many of the multilingual mTedX videos don't have the speaker visible. "Visual Speech Recognition for Multiple Languages" proposed to filter mTedX videos with the speaker visible and the amount of data becomes much less.