astorfi / lip-reading-deeplearning

:unlock: Lip Reading - Cross Audio-Visual Recognition using 3D Architectures
Apache License 2.0
1.84k stars 321 forks source link

Complete noob questions - 1) model purpose? 2) pre-trained weights? 3) other languages? #10

Closed taewookim closed 6 years ago

taewookim commented 6 years ago

Excuse my complete noob-ness

1) is the model trying to accurately determine if the video (i.e. shape of lips) and audio are sync'ed?

2) Any pre-trained weights I can download to run ?

3) Assuming my Q1 is correct.. has anyone tested to see if this model can accurately detect audio/video synchronization on non-english languages?

astorfi commented 6 years ago

@taewookim

  1. Yes, ideally the method should be able to do so.
  2. No. Unfortunately, due to some data privacy, the trained weights have not been released. Although the dataset is public and available as The BBC-Oxford 'Lip Reading in the Wild' (LRW) Dataset.
  3. A similar model without 3D convolution operation and online pair selection has been proposed and implemented and titled as Out of time: automated lip sync in the wild. We compared our method with the aforementioned research effort but did not go to that level.
taewookim commented 6 years ago

thank you @astorfi Regarding Q3.. have you ever run the model on videos where speakers are speaking in non-English language? The models don't have to be super accurate, but i was wondering if this model was 'good enough' to determine audio spoofing of videos of non-English speakers.

Suppose a spoofer was attempting to bypass a system that uses face and speech recognition. He would hold up a video that contains the victim's face and voice recorded on, say, an ipad. He would be hiding from the detection camera (to defeat facial recognition) and would use his own voice, not the voice from ipad (to defeat the the speech recocognition system).

Simple solution might be to just look at time offset of the words and compare with the time offset of when the lips move. Of course, this isn't perfect, but at least somewhere to start from. Any idea what part of your code I can modify to detect this?

astorfi commented 6 years ago

No, I personally did not run it on Non-English dataset but the paper that I mentioned did it (Out of time: automated lip sync in the wild). About the question you are asking, unfortunately, I am not expert.

taewookim commented 6 years ago

thanks you