Elsaam2y / DINet_optimized

An optimized pipeline for DINet reducing inference latency for up to 60% 🚀. Kudos for the authors of the original repo for this amazing work.
98 stars 17 forks source link

could you please share the syncnet training code? #7

Closed mystijk closed 9 months ago

mystijk commented 1 year ago

as i can see in dinet, the syncnet training is similiar to wav2lip .however ,it is not easy to me to reimplement the training code. so ,could you please share the training code?

mystijk commented 1 year ago

also,another question is that, i found some other porject like geneface, direct using extract hubert/wav2vec feature to train syncnet, could it use that idea here?

Elsaam2y commented 1 year ago

Hi, In this repo I didn't RETRAIN dinet model with a new syncnet, instead I am using the originally trained model but with a new mapping to avoid using DeepSpeech since the author used the first version which is too slow. The reason for doing is actually because of the syncnet training since I couldn't get the same results as the original paper and hence I decided to avoid retraining the sycnnet. But during my experiment earlier, I used the synced training from wav2lip, or more specifically wav2lip_288*288. I might give the syncnet another try later because I want to drop this DINet and mapping models, and train another model from scratch using different audio features extraction to match all languages. Will keep you posted if I took further steps in this direction.

Elsaam2y commented 1 year ago

also,another question is that, i found some other porject like geneface, direct using extract hubert/wav2vec feature to train syncnet, could it use that idea here?

Yes definitely. However I was trying recently to avoid using wav2vec because it doesn't work well with non-latin languages. But we could use the same concept either with wav2vec or may be the latest version of DeepSpeech to train syncnet.

mystijk commented 1 year ago

however, seems that DeepSpeech could output different dimesion with different languages. for example, it will output 768 on chinese, and output 1024 on english. of cause it is happened on wav2vec as well. as i seen in geneface , they use landmark instead of full image to training syncnet. just guess it could be easier to train syncnet in that way. as far as my understanding, the key point is training a syncnet to help the generator whatever the implemation is. and the syncnet is really hard to training.

Elsaam2y commented 1 year ago

however, seems that DeepSpeech could output different dimesion with different languages. for example, it will output 768 on chinese, and output 1024 on English.

@mystijk Which version of DeepSpeech are you referring to?

mystijk commented 1 year ago

sorry,i made mistake here. "it will output 768 on chinese, and output 1024 on english. " should be "it generate output 256-dim on chinese, and output 29-dim on english. " i used 0.9.3 . i refered https://github.com/FengYen-Chang/DeepSpeech.OpenVINO to get deepspeech feature. And, it is really difficult to make deepspeech/wav2vec work. maybe, i should go back to use mel audio feature. I mean color_syncnet is not that hard in dinet.

mystijk commented 1 year ago

also, deepspeech used lstm to create audio feature, it might be slower than getting mel.

Elsaam2y commented 1 year ago

@mystijk Okay, now this makes sense :). I was confused from the other dimensions.

And yes, deepspeech is much slower than mel spectrograms.