Closed davidmartinrius closed 1 year ago
Hi,
No, 5 videos of 10-30s would be too less. In this case you can better add your videos to the dataset and retrain the model. You can try using the saved checkpoints and train only for the final fine stage to speed up the process.
Also I noticed when you are in the step "6. Extracting deepspeech features from all audios and saving features..." you still use deepspeech. Is that right?
Yes that's right and this is mainly to avoid retraining the model on different audio feature extractor to avoid losing the quality. During inference we use the wav2vec model and ma the extracted features to the expected ones of DeepSpeech. This aims to speed up the inference significantly.
Please let me know if you faced any issues. Thanks.
Ok, So by now I am going to retrain the model in the final stage with the HDTF dataset + my videos.
Thank you
@davidmartinrius how did the fine-tuning go?
Hi @9bitss , simply didn't go. Until there is a clear explanation of how to train Syncnet I am not willing to do it. I have already seen quite a few people who say they have wasted many hours of their time training it without satisfactory results.
Hi @davidmartinrius, Same here lots of money was spent on A100 GPU to train HDTF and my custom dataset. The result was not good. The only thing that seems promising is to train my own dataset with the latest checkpoint. It does a good job of reducing the inpainting issues. But this time lip movements are not as good as the original model.
Hello!
Please, could you explain how to fine tune a DINet checkpoint for a specific target? I know the process may be similar to training, but when finetuning for a small dataset I don't know how to achieve that. My dataset is composed by 5 videos and each video is only like 10-30 seconds. I don't know if this will be enough for a fine tuning with the provided checkpoint or I will need longer videos.
Also I noticed when you are in the step "6. Extracting deepspeech features from all audios and saving features..." you still use deepspeech. Is that right?
What would be the recipe to fine tune?
Thank you!