DinoMan / speech-driven-animation

949 stars 289 forks source link

Absurd Output with CREMA-D pre-trained checkpoint #39

Closed avisekiit closed 4 years ago

avisekiit commented 4 years ago

Dear Authors, Thanks for open sourcing your work. I faced a strange issue. If I try to run the image.bmp (should not be related to any dataset, I guess) file with a .wav file from CREMA-D dataset, then the visual quality is too bad. You can see an attached frame with this thread. Screenshot 2020-02-08 at 1 38 30 AM

I made sure to load the crema.dat file before running. Is there any extra step to satisfactorily run your CREMA-D pre-trained model?

Thanks, Avisek

DinoMan commented 4 years ago

The image is from grid so it will work with the grid model and the LRW model (not available publicly). Since the models in the small datasets are trained on few speakers they do not generalize well to subjects from different datasets. If you want to you can use the grid model with the example.

avisekiit commented 4 years ago

Dear @DinoMan

Thanks for your reply. I also tried one of the subjects from CREMA-D. For example, I used dlib to detect and crop out a face from a video frame of subject 1057 and then resized to 128 (height) X 96 (width). But, still the output seems to weird. Is there any normalization/frontalization required after I detect and crop out the face with dlib?

Below is the input image to the network.

ref_1057_cropped

DinoMan commented 4 years ago

The library will align and crop the face automatically so there is no need to do this. Just put any image that contains the face. If this is the image you are using I can tell that it is very different from the alignment I use.

avisekiit commented 4 years ago

Ok. So you are suggesting that I can just feed in the entire frame as it is without externally worrying about face detection. ? Let me try this way. Also, below is the bare minimum code I am using to get some visual results. Does it look Ok to you ?

import sda subject="1057" va = sda.VideoAnimator(gpu=-1, model_path="crema") vid, aud = va("crema_refframes/ref{}_cropped.bmp".format(subject), "example/{}.wav".format(subject)) print("Shape:",vid.shape) va.savevideo(vid, aud, "generated{}.mp4".format(subject))

DinoMan commented 4 years ago

Yes the code looks fine to me. The library will download a face alignment library and use it to align the face. It should just be like the example in the documentation.

avisekiit commented 4 years ago

Thanks for your quick responses. So I managed to get a more decent video this time. Does it look OK to you.

Anchor frame link: https://drive.google.com/open?id=1AUrhT5PsRWJd-XOxA64Eeje7OVAaJ78e

Audio Link: https://drive.google.com/open?id=1QAejDExnD0ZmZTpnDYG6lFP-4rLsQJOM Generated Video Link: https://drive.google.com/open?id=1pZN09bDGBty4HJe60MVj6ivgtj-K6WwH

DinoMan commented 4 years ago

Yeah that is a plausible result. Some of the videos won't be very good. Starting from a frame with closed mouth sometimes might get you a slightly better result.

avisekiit commented 4 years ago

Ok. Thanks for all the help. Closing this issue now. Once again, congrats on the awesome work!!!

avisekiit commented 4 years ago

Dear @DinoMan Just re-opening this thread to re-assure some observations. I concatenated the sentences of a given subject and made a compact .wav file. Then I fed the frame of that same person to the network. I can see that for first 4-5 seconds, the visual quality is OK, but then the visual sequence is totally off-sync. Do you think the model will not work if we just stich the audios together. Here are two output video links. Keenly looking forward to your expert thoughts on this:

Video 1 Link: https://drive.google.com/open?id=1Ma5RqCTbPIghaK8ee2mWB5k2KHiYjibo

Video 2 Link: https://drive.google.com/open?id=1VY5bhT5cfdTBxaX-KwMdRn1zOc2eVJ08

DinoMan commented 4 years ago

Ok, I think I have an explanation. The model does indeed tend to generate some artifacts the longer the sequence lasts and in your case you are using very large sequences. This deterioration in performance could be in part because the model sees 3-second clips and not minutes. Although the model has been designed to deal with variable length perhaps GRUs are practically not great at extrapolating that far into the future (which I've also seen in other applications too).

On a more practical note perhaps a solution could be that you generate multiple smaller clips but use the last generated frame as a starting point for the next generation. In this case, you would likely get a smooth result.

avisekiit commented 4 years ago

@DinoMan Thanks for your reply. So, in your paper, when you report the metrics, you usually run the models on each individual sentences and then average the performance across all video ?

DinoMan commented 4 years ago

Yes.

avisekiit commented 4 years ago

Thanks again for all your help. Closing the issue now. Cheers!!!