Elsaam2y / DINet_optimized

An optimized pipeline for DINet reducing inference latency for up to 60% 🚀. Kudos for the authors of the original repo for this amazing work.
93 stars 15 forks source link

Retraining DINet #4

Closed Bebaam closed 8 months ago

Bebaam commented 10 months ago

Hey,

nice idea with replacing deepspeech with wav2vec2, as it should be much easier now to train DINet with different languages, using the models provided in pytorch.

Did you retrain the models with wav2vec2, especially the syncnet? I tried the syncnet using wav2vec (and deepspeech ofc), but was not successful.

Elsaam2y commented 10 months ago

Hi, yes I did. However, the model performance was not always consistent.To train the syncnet, I follow the approach presented in wav2lip since the authors of DINet didn't provide enough information with this regard. Still the results weren't always nice and thats why I thought about an alternative approach by keeping the originally trained model (till revising the syncnet again) and train a mapping model to map the wav2vec features to the expected deep speech features.

Bebaam commented 10 months ago

Yes, I also used the Wav2Lip approach. Did you use the BCE loss as well? I wasn't able to reduce the loss below 0.69, although I used HDTF+Mead and in another try the same data with which I was able to train the original wav2lip syncnet (AvSpeech). You were able to reduce the loss below 0.69 (if you used BCE)? Did you only used HDTF+Mead? I am still searching for what I am doing wrong :D

Elsaam2y commented 10 months ago

Yes I did used BCE loss but I have to say that the results were not always convincing.

The convergence of the syncnet is a common problem. In my case it took me few days to start witnessing convergence in the syncnet training losses. You need to keep it training for some more time. Another option is to increase the dataset.

In my case I used HDTF only and may be this is the reason I didn‘t always get good results. I kept training for long time but the convergence wasn’t super nice, but at the same time wasn’t stuck at 0.69. An idea which I am planning to test is to retrain the syncnet on AVSpeech. This is a bigger and more diverse dataset and could help improving the model convergence.

Bebaam commented 10 months ago

Ok thank you for the insight. I'll keep training then for some days. The loss to aim for should be below <0.25 as in the Wav2Lip repo.

The AvSpeech dataset is huge and needs a lot of filtering. There are a lot of out-of-sync or low quality files in many languages, having good convergence may be difficult then.

Elsaam2y commented 10 months ago

Regarding the dataset preprocessing, you can have a look at this repo:

https://github.com/primepake/wav2lip_288x288

He has trained on AVSpeech and I believe he gave some insights about the preprocessing. I will also keep you posted if I started with it and will share any scripts for preprocessing. .

pgyilun commented 10 months ago

in my case,about 25 hours of data. Used the BCE loss,overfitting is encountered when test loss is about 0.48

Elsaam2y commented 8 months ago

Feel free to reopen it if needed.