In th paper, "To ensure high-quality training data, we underwent a data cleaning process that focused on retaining single-person speaking videos exhibiting strong lip and audio consistency".
So how to select videos with strong lip and audio consistency, can you share some ideas?
In th paper, "To ensure high-quality training data, we underwent a data cleaning process that focused on retaining single-person speaking videos exhibiting strong lip and audio consistency". So how to select videos with strong lip and audio consistency, can you share some ideas?