Closed aretius closed 4 years ago
You are trying to load 1800 frames (maybe even HD?) at once. You might need to modify the code to load on the fly and generate or split ur audio and video into several chunks and process them independently.
@prajwalkr Yes they are HD 🤣 Could you please point me in the right direction with the piece of code? Like what and how to change etc? That would be really helpful
The existing code loads all the video frames, performs face detection on all of them, and yields batches of faces and audio segments. Instead of changing the code I would suggest you to write another script of your own that splits your big audio and video file into one-two second chunks and syncs them and generates one-two second videos and combines them later again. You can do this with FFMPEG and OpenCV.
Got it thanks! Also when trying the Use Case 1 is it possible to sync audio with the new video?
Yes, the use case 1 is to perform lip-syncing on a video to match any new audio.
Sorry for not being clear! I mean once the resultant video is generated does it also have the new audio file(which we feed as input) embedded in it?
Yes. It does. It is saved as result_voice.avi
. The one without an audio overlay is saved as result.avi
.
Great, thanks! So I reduced the frames to 30 and lower the resolution so the video file is now around ~5MB's and it seems to work. However, the output video is rotated by 90 degrees and so the lips are not aligned properly (my guess is it gets inputted as a rotated video). Any help on that?
I do not think the LipGAN code rotates the video. Please inspect the video in OpenCV separately and check.
Yes figured that out, seems to be a problem with mobile videos. I tried to test it on a video I have - attached zip file. You can observe sometimes the video tearing result.zip The results you showcased were amazing! I was wondering if there are some tips on what type of videos (resolutions, fps, category etc) are best with the model?
You can adjust the (lower) padding such that the face crop fully covers the face, including the chin area.
Refer this line for more info.
Hey @prajwalkr! Is it possible to highlight the face crop area at each frame so that I can debug it more successfully? Also does the face crop need to cover the full face or just the mouth area?
This line contains the coordinates of the face. If you want you can draw a rectangle using OpenCV. The rectangle must be drawn on the variable f.
@Rudrabha @prajwalkr Hey guys, Thanks for the prompt replies - Much appreciated. I have attached an example that seems to be way out of sync. To provide more context -
I have used Google's TTS service. Use English Male Wavenet voice
The audio file stats in audio_hparams.py
seem to be different from I supply like sampling_rate
etc. Do you suggest to use some different TTS service, one which is similar to the training distribution? Or should I just change parameters in some files?
Do you have some suggestions on how to fix the model for such videos? Basically the speaker doesn't seem to pause at right moments (pauses are 1-2 seconds) and lipsync of words doesn't seem to be proper.
@prajwalkr Thanks for the info regarding "in the wild" with pauses videos.
Just a small note regarding audio files, the audio_hparams.py
doesn't match my intended audio file. Should I change the audio file to match the given parameters or just leave it as it is.
It does not matter, as the code resamples all audio files to the same sampling rate etc.
Hey @prajwalkr Thanks for sharing the codebase and models for such a wonderful project! Much appreciated I have successfully been able to replicate Use Case-2 (using image with lip sync) on both CPU & Google Colab.
However for Use Case 1 (using video with LipSync) Both my CPU and Google Colab quit the process since it consumes a lot of RAM. I have a short text about 75 Characters long. I have an input video of 30 seconds at 60 FPS. How do I end up using it here?