OpenTalker / video-retalking

[SIGGRAPH Asia 2022] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
https://opentalker.github.io/video-retalking/
Apache License 2.0
6.02k stars 895 forks source link

sieve improvements #214

Open cparello opened 4 months ago

cparello commented 4 months ago

https://www.sievedata.com/blog/fast-high-quality-ai-lipsyncing

Can the improvements made by sieve be done here?

savleharshad commented 4 months ago

they did some optimizations. how can we idea of what optimization can we do to improve model?

cparello commented 3 months ago

they explain it all in the doc and the models can be updated from the 512 to the 1024 or the 2048. this part i already did, but i have not had the chance to attempt the sieve code improvements

jryebread commented 2 months ago

where do you see that they explained what changes they made? it sais they would open source the results but don't see anything :(

cparello commented 2 months ago

Our Improvements To improve this, we’ve introduced a series of optimizations on the original repository that greatly improves speed and performance.

The first optimization is smartly cropping around the face of the target speaker to avoid unnecessarily processing most of the video. Along with the ML models, there are a lot of computer vision operations like warping, inverse transforms, etc. in Video Retalking that are expensive to perform on the entire face. We quickly identify the target speaker using batched RetinaFace, a very lightweight face detector. In many scenarios, there are multiple faces or even multiple predictions of the same face that occur, so we have to isolate the largest face. For now, we treat that as the target speaker. Then, we crop the video around the union of all detections of the face. This allows us to process a much smaller subsection of the video, which greatly speeds up inference up to 4x faster, especially on videos where the face is smaller and doesn’t move a lot. In addition, establishing the target speaker crop allows us to solely enhance that part of the video, rather than potentially generating artifacts around other sections of the frame.

Second, we added batching to the stabilization step, making this step much faster when combined with the cropping above. We also removed enhancement of the stabilized video, as we found that its inclusion did not affect quality after we performed the cropping above.

When detecting facial landmarks, the original repository reinitialized the keypoint extractor multiple times and performed duplicate computations of landmarks during multiple steps of the process due to input resizing. We initialize the keypoint extractor once, and allow landmarks calculated before to be resized and reused during facial alignment. On low resolution inputs where the face is really small, we bypass parts of the alignment that actually made the output look worse, as the feature detection was much less accurate.

Finally, we made the code more durable to edge cases where no faces are detected (by ignoring these frames), more than one face is detected (by detecting the largest face), or there is lots of movement from the speaker (by being smart about cropping).

In addition, we’ve optimized memory and GPU memory throughout the code so that it can fit on an L4 GPU with 8 vCPUs and 32 GB RAM, making it very cost-effective. We also added a low resolution and low fps option to allow for up to an additional 4x speedup for scenarios where speed matters more than quality.

wlcdbb commented 1 week ago

i just use sieve and output a video. It does not improve too much,although it's faster