Closed Bebaam closed 1 year ago
Hi, yes I did. However, the model performance was not always consistent.To train the syncnet, I follow the approach presented in wav2lip since the authors of DINet didn't provide enough information with this regard. Still the results weren't always nice and thats why I thought about an alternative approach by keeping the originally trained model (till revising the syncnet again) and train a mapping model to map the wav2vec features to the expected deep speech features.
Yes, I also used the Wav2Lip approach. Did you use the BCE loss as well? I wasn't able to reduce the loss below 0.69, although I used HDTF+Mead and in another try the same data with which I was able to train the original wav2lip syncnet (AvSpeech). You were able to reduce the loss below 0.69 (if you used BCE)? Did you only used HDTF+Mead? I am still searching for what I am doing wrong :D
Yes I did used BCE loss but I have to say that the results were not always convincing.
The convergence of the syncnet is a common problem. In my case it took me few days to start witnessing convergence in the syncnet training losses. You need to keep it training for some more time. Another option is to increase the dataset.
In my case I used HDTF only and may be this is the reason I didn‘t always get good results. I kept training for long time but the convergence wasn’t super nice, but at the same time wasn’t stuck at 0.69. An idea which I am planning to test is to retrain the syncnet on AVSpeech. This is a bigger and more diverse dataset and could help improving the model convergence.
Ok thank you for the insight. I'll keep training then for some days. The loss to aim for should be below <0.25 as in the Wav2Lip repo.
The AvSpeech dataset is huge and needs a lot of filtering. There are a lot of out-of-sync or low quality files in many languages, having good convergence may be difficult then.
Regarding the dataset preprocessing, you can have a look at this repo:
https://github.com/primepake/wav2lip_288x288
He has trained on AVSpeech and I believe he gave some insights about the preprocessing. I will also keep you posted if I started with it and will share any scripts for preprocessing. .
in my case,about 25 hours of data. Used the BCE loss,overfitting is encountered when test loss is about 0.48
Feel free to reopen it if needed.
Hi,
Its better idea to use wav2vec as it is feature rich compared to Deep speech.
Did you try training the DINEt with the wav2vec features mapped to deep speech shape?
I tried the syncnet training with mels spectrogram as the audio feature and the syncloss were less than 0.03. But facing issues while integrating to the DINet (either by changing the auido encoder or changing the existing audio encoder input shape to match the mels spectrogram). Did you try to use the mels features(as in wav2lip) for audio processing at any point of time in your experiment?
Hey,
nice idea with replacing deepspeech with wav2vec2, as it should be much easier now to train DINet with different languages, using the models provided in pytorch.
Did you retrain the models with wav2vec2, especially the syncnet? I tried the syncnet using wav2vec (and deepspeech ofc), but was not successful.