Open paulovasconcellos-hotmart opened 1 week ago
I think this is a non-issue. The code you are referring to is for training the model. And I feel talking a random reference face to use for prediction of the label is actually a more difficult task to do then taking the previous frame. Also the previous frame would be too similar to the current frame making it unusable in prediction.
The approach of using a random reference image is inspired by similar methodologies used in projects like wav2lip and video-retalking. This technique has been found to be effective in preventing the model from taking shortcuts during the training phase.
The choice to use a random reference image instead of the immediate previous frame is a deliberate one, aimed at enhancing the robustness of the model. If the reference image were too similar to the ground truth, the model might indeed "cheat" by simply copying features from the ground truth, rather than learning to generate the mouth shape based on the audio features. This could lead to a model that does not generalize well to new, unseen data.
By introducing variability with a random reference image, we encourage the model to focus on the audio features and learn the complex mapping from audio to visual speech synthesis. This approach can lead to a more flexible and accurate model that is better equipped to handle variations in speech and mouth movements.
We understand that this might make the training process more challenging, but it is a necessary step to ensure that the model learns to perform the task correctly and not rely on shortcuts.
English I was checking the DataLoader code and wondered why MuseTalk uses a random reference frame from the video instead of the previous one of the current frame.
Chinese (translated with Google Translate) 我正在檢查 DataLoader 程式碼,想知道為什麼 MuseTalk 使用視訊中的隨機參考幀而不是當前幀的前一個參考幀。