Open KimGyeongsu opened 7 months ago
Hi thanks for sharing the tips!
May I know the learning rate you used eventually? I guess many people including myself have limited computation to tune it multiple times. It would be greatly helpful if you can share it!
Thank you
I used initial learning rate as 1e-5!
Thank you for your quick reply! Sorry but did you use 1e-5 only for clip or for every stage of frame training?
For all stages, I used 1e-5.
Thanks for sharing that!
Did you train SyncNet from scratch or use the provided pre-trained model? Also, I thought SyncNet only had a clip mode, but in your above response, it seems like there is a frame mode mentioned.
Similar to dinet training code that author provided, I trained syncnet from scratch. For frame stage, audio feature should be deepspeech featrue from [n-2:n+3], if we use n th frame as face feature.
Hello, I also have a question. How can we determine whether the model has converged during the training process of DINet?
@KimGyeongsu what loss function did you use for training?
I finally acheived sync loss ~0.2 with private dataset with simple modification. Please understand that I can't upload the training code because I'm belong to the company. I hope that my advice should be helpful.