Thanks for you great work. I found that the training is with random landmark input, but in inference, it can only input audio. can you introduce how to acchieve without degard result.
Thank you for the interest. During training, the pose is randomly dropped, which leads to some audio-only cases. It is the reason why it works with only audio during inference.
Thanks for you great work. I found that the training is with random landmark input, but in inference, it can only input audio. can you introduce how to acchieve without degard result.