Closed mystijk closed 11 months ago
also,another question is that, i found some other porject like geneface, direct using extract hubert/wav2vec feature to train syncnet, could it use that idea here?
Hi, In this repo I didn't RETRAIN dinet model with a new syncnet, instead I am using the originally trained model but with a new mapping to avoid using DeepSpeech since the author used the first version which is too slow. The reason for doing is actually because of the syncnet training since I couldn't get the same results as the original paper and hence I decided to avoid retraining the sycnnet. But during my experiment earlier, I used the synced training from wav2lip, or more specifically wav2lip_288*288. I might give the syncnet another try later because I want to drop this DINet and mapping models, and train another model from scratch using different audio features extraction to match all languages. Will keep you posted if I took further steps in this direction.
also,another question is that, i found some other porject like geneface, direct using extract hubert/wav2vec feature to train syncnet, could it use that idea here?
Yes definitely. However I was trying recently to avoid using wav2vec because it doesn't work well with non-latin languages. But we could use the same concept either with wav2vec or may be the latest version of DeepSpeech to train syncnet.
however, seems that DeepSpeech could output different dimesion with different languages. for example, it will output 768 on chinese, and output 1024 on english. of cause it is happened on wav2vec as well. as i seen in geneface , they use landmark instead of full image to training syncnet. just guess it could be easier to train syncnet in that way. as far as my understanding, the key point is training a syncnet to help the generator whatever the implemation is. and the syncnet is really hard to training.
however, seems that DeepSpeech could output different dimesion with different languages. for example, it will output 768 on chinese, and output 1024 on English.
@mystijk Which version of DeepSpeech are you referring to?
sorry,i made mistake here. "it will output 768 on chinese, and output 1024 on english. " should be "it generate output 256-dim on chinese, and output 29-dim on english. " i used 0.9.3 . i refered https://github.com/FengYen-Chang/DeepSpeech.OpenVINO to get deepspeech feature. And, it is really difficult to make deepspeech/wav2vec work. maybe, i should go back to use mel audio feature. I mean color_syncnet is not that hard in dinet.
also, deepspeech used lstm to create audio feature, it might be slower than getting mel.
@mystijk Okay, now this makes sense :). I was confused from the other dimensions.
And yes, deepspeech is much slower than mel spectrograms.
@mystijk I had tried to use mels to extract the audio features and modified the audio encoder in the Dinet to work for the 80 channel mels compared to that of 29 in deep speech. The results are not good and the mouth region is not getting reconstructed well. In sone sample videos, there is no lipsync at all. Did you face similar issues while using melspectrograms on Dinet?
as i can see in dinet, the syncnet training is similiar to wav2lip .however ,it is not easy to me to reimplement the training code. so ,could you please share the training code?