Closed voku682 closed 11 months ago
Hello! Could you give me some details related to your set up. Are you training your own model from scratch, or using a pretrained one?
If using the pretrained, note that it performs very poorly on videos outside the crema-d dataset. You would need to either finetune the model on top of your data, or train a new one from scratch. Depending on your needs for generalisation and your dataset size, this can take anywhere from a few days, to weeks.
Hey Dan, thanks for the quick response and for sharing such an amazing model. I am using the pre-trained multispeaker checkpoint. If you can clarify some general questions regarding the model it would be great
No worries, sure!
1: I believe its overfitting. Because CREMA-D isn't a diverse dataset (by diverse I mean different backgrounds, not that many speakers in the grand scheme of things, all the videos being relatively the same in terms of style), the model performs poorly on more natural "in-the-wild" videos. To fix this, training on a bigger dataset like vox celeb is what I would recommend, but that will take quite a lot of resources and time which is why I couldn't do it.
You are correct, the multi-speaker model can indeed be finetuned on a single speaker. We include the two checkpoints simply because our first experiments consisted of getting the single speaker model to work. So we decided to include it here as well.
Hmmm I'm not sure, when we were doing the experiments, we encountered a similar problem as you did, where the output would degrade significantly after a few timesteps. Even when testing on videos from the same dataset. Our problem however was the opposite, the model was underfitting on the dataset, this was because there was a bug with our attention layers where they were never being called during training. Once we fixed that and ensured attention was correctly implemented, the model began converging. In your case though, I suspect the main issue is that its overfit on crema-D.
Thanks! I will look into this and train the model on my own dataset to see if it solves the issue.
@voku682 I encountered the same problem. Did the problem solved after train the model on you own dataset.
I am running the inference script but getting somewhat unprocessed images like the one below. Only the first image in Generated_Frames is decent the rest all are pixelated. There are no errors thrown so I don't know exactly what is going wrong. Any help would be appreciated!