When using video colab gets killed - Githubissues

Rudrabha / LipGAN

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

http://cvit.iiit.ac.in/research/projects/cvit-projects/facetoface-translation

MIT License

578 stars 122 forks source link

When using video colab gets killed #26

Closed aretius closed 4 years ago

aretius commented 4 years ago

Hey @prajwalkr Thanks for sharing the codebase and models for such a wonderful project! Much appreciated I have successfully been able to replicate Use Case-2 (using image with lip sync) on both CPU & Google Colab.

However for Use Case 1 (using video with LipSync) Both my CPU and Google Colab quit the process since it consumes a lot of RAM. I have a short text about 75 Characters long. I have an input video of 30 seconds at 60 FPS. How do I end up using it here?

prajwalkr commented 4 years ago

You are trying to load 1800 frames (maybe even HD?) at once. You might need to modify the code to load on the fly and generate or split ur audio and video into several chunks and process them independently.

aretius commented 4 years ago

@prajwalkr Yes they are HD 🤣 Could you please point me in the right direction with the piece of code? Like what and how to change etc? That would be really helpful

prajwalkr commented 4 years ago

The existing code loads all the video frames, performs face detection on all of them, and yields batches of faces and audio segments. Instead of changing the code I would suggest you to write another script of your own that splits your big audio and video file into one-two second chunks and syncs them and generates one-two second videos and combines them later again. You can do this with FFMPEG and OpenCV.

aretius commented 4 years ago

Got it thanks! Also when trying the Use Case 1 is it possible to sync audio with the new video?

prajwalkr commented 4 years ago

Yes, the use case 1 is to perform lip-syncing on a video to match any new audio.

aretius commented 4 years ago

Sorry for not being clear! I mean once the resultant video is generated does it also have the new audio file(which we feed as input) embedded in it?

prajwalkr commented 4 years ago

Yes. It does. It is saved as result_voice.avi. The one without an audio overlay is saved as result.avi.

aretius commented 4 years ago

Great, thanks! So I reduced the frames to 30 and lower the resolution so the video file is now around ~5MB's and it seems to work. However, the output video is rotated by 90 degrees and so the lips are not aligned properly (my guess is it gets inputted as a rotated video). Any help on that?

prajwalkr commented 4 years ago

I do not think the LipGAN code rotates the video. Please inspect the video in OpenCV separately and check.

aretius commented 4 years ago

Yes figured that out, seems to be a problem with mobile videos. I tried to test it on a video I have - attached zip file. You can observe sometimes the video tearing result.zip The results you showcased were amazing! I was wondering if there are some tips on what type of videos (resolutions, fps, category etc) are best with the model?

prajwalkr commented 4 years ago

You can adjust the (lower) padding such that the face crop fully covers the face, including the chin area.

Refer this line for more info.

aretius commented 4 years ago

Hey @prajwalkr! Is it possible to highlight the face crop area at each frame so that I can debug it more successfully? Also does the face crop need to cover the full face or just the mouth area?

Rudrabha commented 4 years ago

This line contains the coordinates of the face. If you want you can draw a rectangle using OpenCV. The rectangle must be drawn on the variable f.

aretius commented 4 years ago

@Rudrabha @prajwalkr Hey guys, Thanks for the prompt replies - Much appreciated. I have attached an example that seems to be way out of sync. To provide more context -

I have used Google's TTS service. Use English Male Wavenet voice
The audio file stats in audio_hparams.py seem to be different from I supply like sampling_rate etc. Do you suggest to use some different TTS service, one which is similar to the training distribution? Or should I just change parameters in some files?
Do you have some suggestions on how to fix the model for such videos? Basically the speaker doesn't seem to pause at right moments (pauses are 1-2 seconds) and lipsync of words doesn't seem to be proper.

aretius commented 4 years ago

@prajwalkr Thanks for the info regarding "in the wild" with pauses videos. Just a small note regarding audio files, the audio_hparams.py doesn't match my intended audio file. Should I change the audio file to match the given parameters or just leave it as it is.

prajwalkr commented 4 years ago

It does not matter, as the code resamples all audio files to the same sampling rate etc.