Target Speech Representation Database

YuanxunLu / LiveSpeechPortraits

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

MIT License

1.16k stars 198 forks source link

Target Speech Representation Database #57

Closed torphix closed 2 years ago

torphix commented 2 years ago

Hi Thank you for amazing lib and open source code,

Helping me learn a lot. One question I had was with regards to the target speech representation database. Is it simply the embedding of several speech from target speaker and then the inputted speech is essentially mapped to the closest point within those embeddings?

Eg: Extract embedding from 50 obama utterances -> input arbitrary speech sample -> map embedding of arbitrary speech sample to the closest obama representation

Thank you

YuanxunLu commented 2 years ago

Actually it is a best linear combination of K nearest samples by solving a least-square optimization, and what you say is just the condition of K=1.

torphix commented 2 years ago

Thank you, So the APC_feat_database is several thousand examples of the target speaker talking embedded into feature space using the APC network?

YuanxunLu commented 2 years ago

Yes, you're right.