Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.28k stars 2.21k forks source link

About HD video #249

Closed 15458wew closed 3 years ago

15458wew commented 3 years ago

Hello, thank you for opening up such a good code. I am trying to modify the code and experiment with avspeech to generate more high-definition videos. I have encountered a few problems and want to communicate with you.

  1. In your introduction, you said that many videos downloaded from the Internet need to be aligned again. I don't know if avspeech needs to be aligned. Can you give me some suggestions if necessary.
  2. I tried to download the most high-definition avspeech, but the resolution of the detected face is still not very high. I don't know if I can generate a higher-definition face.
  3. I noticed that many avspeech videos contain multiple faces and audio from multiple speakers. I read your code, and I think the code cannot handle this situation. Can you give me some suggestions.

Thank you

Rudrabha commented 3 years ago
  1. AVSpeech videos should be aligned again using the codes present in this repo.
  2. Please let us know if you find a higher resolution dataset than AVSpeech. The AVSpeech videos are not collected specifically for lip-sync. So maybe some dataset that specifically collects data from multiple speakers in 4k will be useful for us.
  3. The AVSpeech contains coordinates of the active speaker. We use that followed by another set of face detection to minimize the amount of background. Our code only takes a single face at a time and corresponding audio. Our architecture can accept one frame(with a face) at a time.
15458wew commented 3 years ago

Thanks for your reply, if I find a high-definition video data set, I will share it with you. If I use avspeech for training, do I need to improve on the current code? In addition, I noticed that the lrs2 data set has a lot of videos of side faces. s3fd only extracts the approximate range of the face, and does not convert the side face to the front face. Will the side-face video affect the network?

alex-uspenskyi commented 3 years ago

Hey there! Any success with the AVSpeech dataset or any other HD dataset? I'm trying to adapt AVSpeech but maybe there is a better way.

PMA65 commented 3 years ago
1. AVSpeech videos should be aligned again using the codes present in [this](https://github.com/joonson/syncnet_python) repo.

2. Please let us know if you find a higher resolution dataset than AVSpeech. The AVSpeech videos are not collected specifically for lip-sync. So maybe some dataset that specifically collects data from multiple speakers in 4k will be useful for us.

3. The AVSpeech contains coordinates of the active speaker. We use that followed by another set of face detection to minimize the amount of background. Our code only takes a single face at a time and corresponding audio. Our architecture can accept one frame(with a face) at a time.

@Rudrabha Could you please explain what alignment you mean ? Could be more specific what part of the code in the repo you tagged does the alignment ? Thanks

PMA65 commented 3 years ago

@Rudrabha Is this the alignment problem ? found this avspeech downloader repo https://github.com/changil/avspeech-downloader

Known issue

FFmpeg uses keyframe seeking when stream copying, which happens with faster=2. When a cut does not start from a keyframe, which happens most of the time, it cuts the video at the closest preceding keyframe and sets a negative start time to compensate for it. Thus, any subsequent tools that take the cut video clips as input should take the start time into account. Most video players do, but if you programatically process video clips, chances are you need to do it yourself and discard the first part of both audio and video streams accordingly.