DinoMan / speech-driven-animation

947 stars 289 forks source link

How to handle LRW dataset where speakers move significantly #22

Closed pcgreat closed 4 years ago

pcgreat commented 5 years ago

When handling datasets like GRID where speakers barely move, it's easy to align facial landmark to fixed points (e.g. based on that of the first frame of the video) while preprocessing. However, for datasets like LRW, speakers move significantly while talking, it's meaningless to align based on first frame. Aligning them frame by frame is also not a good choice as it will make the video jittering too much. So I wonder how did you do the preprocessing to LRW dataset?

DinoMan commented 4 years ago

I did the preprocessing in the same way as I did for the other datasets ( I use the face-processor library available on my GitHub). You are correct that lrw has jitter due to the way it was aligned and unfortunately there is no easy way of removing it. I have tried to stabilise the videos but this only helps a little. Because lrw has this jitter the model will end up modelling it and you will see it (to some extent) in the generated videos. In the end I'm afraid you have to live with a little jitter for the lrw model.