Train with AvSpeech dataset

ayush714 commented 2 years ago

Hi, I am also running into the same issue as you guys. I am using a subset of the AVSpeech dataset. The network seems to be hard stuck at 0.69. I once tried training it for a few million iterations but it did not show any sign of learning at all. I suspect the problem is in the syncronisation of my dataset. Have you guys @vokshin @Mayur28 @i-amgeek @yonglianglan tried to sync-correct the dataset with SyncNet? If so could you give me a hint how to do that.

The reason why I assume the problem is in the sync of the given dataset is when I desperately made some changes to the dataloader it started training better. I changed the loading of the audio file, instead of computing the melspectrogram on the whole audio file I only import the relevant 0.2 sec of the audio and compute the melspectrogram of that snippet. This reduces the possible discretization offset by <12.5 ms. Somehow this small change made the network train, but very slowly. Also it started overfitting at a loss of about 0.55 :/ That is why I want to explore the sync-correction with SyncNet.

This is my Loss Curve after the afore mentioned changes. And for reference before the changes. I assume yours look similar?

Originally posted by @GGaryuk in https://github.com/Rudrabha/Wav2Lip/issues/296#issuecomment-883491538

Mayur28 commented 2 years ago

Hi @ayush714 ,

I am currently in the process of pre-processing the AVSpeech dataset in which I am doing all the necessary data-cleaning for the dataset, as well as sync-correcting all videos. Unfortunately my research has been on pause temporarily due to a few issues that I am experiencing, but once I have completed my research (towards the end of the year), I will release the script that downloads and cleans the AVSpeech dataset.

I also with regards to changing the way that you load the audio, it may seem tempting to only load the relevant 0.2 second window, but unfortunately, from my experience, the benefits of this is shortlived. I noticed that in doing so, the model seems to make progress in training, but soon after, the model begins to overfit very quickly. Using this approach that overfits did not make sense to me since the loss that it converges to is nowhere near what it should be to achieve sensible results.

Just to give you an idea of how to sync-correct the dataset, I highly recommend experimenting with this repo (the offical SyncNet repo). Using the instructions in the repo, when you have experimented with the network with a few videos, you will notice that the network outputs 3 measures (AV offset, confidence and min distance). Using these readings, what you will need to do is to work with the AV offset measure, you need to modify the video such that all videos have an AV offset between [-1, 1]. Once you have an AV offset measure for your video, you can then sync correct the video accordingly using ffmpeg.

dprojdlbr commented 2 years ago

@Mayur28

Do you mean that if all videos have AV offset between [-1, 1] the dataset is good to start the training? Do you mean that if a video have AV offset equal 5, one should use ffmpeg to correct it until video have new AV offset equals to -1, 0 or +1? Is that correct?

Thanks!

Mayur28 commented 2 years ago

Hi @dprojdlbr

That is correct. Ideally you would like all videos to have an AV offset of 0, but if a video happens to have an AV offset of -1 or +1, it should be acceptable.
Correct.

TejaswiniiB commented 2 years ago

Hi @Mayur28 , Can you tell the ffmpeg command to sync correct based on AV offset?

And what are the units of AV offset? seconds or milliseconds?

Mayur28 commented 2 years ago

ffmpeg -y -i <video> -itsoffset <shift> -i <video> -ss <shift> -t <full duration of video - abs(shift)> -map 0:v -map 1:a <new output> Here, shift = AV offset / FPS

AV offset is in terms of number of frames. Shift (above) will get you to seconds. A more detailed "explanation".

lsw5835 commented 2 years ago

Hi @Mayur28 Could you share the cleaning scripts for the AVSpeech dataset?

Rudrabha / Wav2Lip

Train with AvSpeech dataset #354