DanBigioi / DiffusionVideoEditing

Official project repo for paper "Speech Driven Video Editing via an Audio-Conditioned Diffusion Model"
MIT License
223 stars 15 forks source link

Inference generating weird output #7

Closed voku682 closed 11 months ago

voku682 commented 11 months ago

I am running the inference script but getting somewhat unprocessed images like the one below. Only the first image in Generated_Frames is decent the rest all are pixelated. There are no errors thrown so I don't know exactly what is going wrong. Any help would be appreciated! res

DanBigioi commented 11 months ago

Hello! Could you give me some details related to your set up. Are you training your own model from scratch, or using a pretrained one?

If using the pretrained, note that it performs very poorly on videos outside the crema-d dataset. You would need to either finetune the model on top of your data, or train a new one from scratch. Depending on your needs for generalisation and your dataset size, this can take anywhere from a few days, to weeks.

voku682 commented 11 months ago

Hey Dan, thanks for the quick response and for sharing such an amazing model. I am using the pre-trained multispeaker checkpoint. If you can clarify some general questions regarding the model it would be great

  1. Why does the multispeaker checkpoint/model performs poorly outside the crema-d dataset? is it because of overfitting or how the data is handled within the code (like different audio sample rate, image encoding etc..)
  2. Why are there two different checkpoints (multilspeaker and single speaker), can't the multispeaker be fine-tuned with the single speaker data?
  3. Is there any part of the code which I can focus on to debug this issue? I feel like this is a bug related to the code implementation (will also raise a PR for the fix if I find the cause)
DanBigioi commented 11 months ago

No worries, sure!

1: I believe its overfitting. Because CREMA-D isn't a diverse dataset (by diverse I mean different backgrounds, not that many speakers in the grand scheme of things, all the videos being relatively the same in terms of style), the model performs poorly on more natural "in-the-wild" videos. To fix this, training on a bigger dataset like vox celeb is what I would recommend, but that will take quite a lot of resources and time which is why I couldn't do it.

  1. You are correct, the multi-speaker model can indeed be finetuned on a single speaker. We include the two checkpoints simply because our first experiments consisted of getting the single speaker model to work. So we decided to include it here as well.

  2. Hmmm I'm not sure, when we were doing the experiments, we encountered a similar problem as you did, where the output would degrade significantly after a few timesteps. Even when testing on videos from the same dataset. Our problem however was the opposite, the model was underfitting on the dataset, this was because there was a bug with our attention layers where they were never being called during training. Once we fixed that and ensured attention was correctly implemented, the model began converging. In your case though, I suspect the main issue is that its overfit on crema-D.

voku682 commented 11 months ago

Thanks! I will look into this and train the model on my own dataset to see if it solves the issue.

gongmm commented 4 months ago

@voku682 I encountered the same problem. Did the problem solved after train the model on you own dataset.