TMElyralab / MuseTalk

MuseTalk: Real-Time High Quality Lip Synchorization with Latent Space Inpainting
Other
2.3k stars 283 forks source link

Potential Test-time Performance Improvement #143

Closed xiankgx closed 2 weeks ago

xiankgx commented 1 month ago

Right now it seems for inference, you are taking the current frame as the reference to predict a different lip, which can be suboptimal.

Perhaps what we can do is first encode all the audio features (keys) for all the face of the original video (values). Then during inference, we get the audio feature for the current frame (query) and use nearest neighbor search. Using the current frame audio feature, we search the most similar face and use that as reference.

alexLIUMinhao commented 1 month ago

Thank you for your suggestion. We indeed did not use audio features for retrieving facial similarity. However this might have an impact on real-time performance ?