Open yonglianglan opened 3 years ago
I faced same problem using lrs3 dataset. Haven't found any solution yet. I am surprised that your loss also stuck at 0.69 (same as mine) even though we are using different datasets.
I ran into this problem when I was trying to train syncnet on my dataset. Then, to sort out this issue, I decided to reproduce the syncnet training on the lrs2 dataset to get the developer's result. I left the original hyperparameters. In other issues, the authors wrote that convergence requires 150,000 to 200,000 iterations. But, unfortunately, 480,000 iterations after the loss reached only 0.68.
Hi @vokshin, I am experiencing the same issue as well with LRS2 where the network does not seem to start converging even when using the repo code as is.
Hi, I am also running into the same issue as you guys. I am using a subset of the AVSpeech dataset. The network seems to be hard stuck at 0.69. I once tried training it for a few million iterations but it did not show any sign of learning at all. I suspect the problem is in the syncronisation of my dataset. Have you guys @vokshin @Mayur28 @i-amgeek @yonglianglan tried to sync-correct the dataset with SyncNet? If so could you give me a hint how to do that.
The reason why I assume the problem is in the sync of the given dataset is when I desperately made some changes to the dataloader it started training better. I changed the loading of the audio file, instead of computing the melspectrogram on the whole audio file I only import the relevant 0.2 sec of the audio and compute the melspectrogram of that snippet. This reduces the possible discretization offset by <12.5 ms. Somehow this small change made the network train, but very slowly. Also it started overfitting at a loss of about 0.55 :/ That is why I want to explore the sync-correction with SyncNet.
This is my Loss Curve after the afore mentioned changes. And for reference before the changes. I assume yours look similar?
Hi, @GGaryuk !
Thank you for the idea: to take the corresponding piece of audio and then calculate the mel spectrogram. I'll try this.
But the fact is that I am trying to reproduce the steps of the developers. I.e. I am trying to train syncnet on the dataset on which they trained it. The authors recommend to synchronize in third-party datasets in order to avoid such problems. However, the problem occurs even with the lrs2 dataset.
So guys, I think I have some understanding of our problem.
First of all thanks to @GGaryuk for the advice to take 0.2s of audio to calculate the corresponding mel spectrogram. This helped me a lot. Loss 0.42 at era 30.
The authors tell us that the problem is in the synchronization of our datasets. It's not so much about datasets, but about the data preprocessed using prepropcess.py. preprocces.py uses a face detection tool. This tool does not work perfectly. Those frames on which no faces are detected are swept aside. As a result, we get fewer frames at the output than there are in the video. However, the length of the audio remains the same.
What follows from this? Out of sync. I wrote a simple bash script that validates the data after preprocess.py has processed it. It traverses the directories and divides the number of frames received by the audio length in seconds. In theory, we should get a number tending to video fps. However, sometimes you can get a value that deviates greatly from fps. For example, here is some of what I got for lrs2 after preprocess.py:
23.848 24.857 24.553 24.375 24.493 24.771 24.609 24.796 24.147 23.958
This is not a bad result. The deviation from 25 is not large. On my own dataset, these deviations are much worse. I believe this is the reason.
Hi @GGaryuk , @vokshin ,
Thanks for the great advice - I am yet to try out the new audio processing method suggested by @GGaryuk - it's great to hear that the model seems to be converging very quickly (even compared to the original implementation).
With regards to the video processing, I also noticed that the S3FD face detector has too many face positives (detecting faces where there isn't), and, the tool is extremely slow. Consequently, I considered another face detection tool which seemed to work almost perfectly from my experience, plus, it took 3-4 hours less time to preprocess the LRS2 dataset. What I did was to adapt the tool in this repo to do face detection, and admittedly, it is a vast improvement to S3FD. In light of your point @vokshin , your explanation makes sense, but I am not sure how big of a difference this would make, because the modelstill did not converge, even when I preprocessed the data using the new method. I'm hoping to consider other interesting tricks as well to improve convergence, such as cirriculum learning, as discussed in this really great paper.
Hi!
I been experiment with a few variants of the original implementations and, as advised by this issue, I have verified that the model is able to converge much quicker (in my case, the training error is already at 0.4 after 20K steps).
Hi @vokshin , @GGaryuk ,
I've tried to implement the solution suggested by @GGaryuk - unfortunately, I have unable to experience any form of convergence (with LRS2 dataset) even after 40K steps, and I'm not sure where am I making a miscalculation. This is what I have used to first extract the corresponding 0.2 seconds of audio, and thereafter to compute the mel-spectrogram.
img_name = random.choice(img_names)
img_idx = self.get_frame_id(img_name)
.....
wavpath = join(vidname, "audio.wav")
wav = audio.load_wav(wavpath, hparams.sample_rate)
new_wav = wav[img_idx*640: (img_idx+5)* 640] # For video: FPS = 25 ... For audio, sr = 16000Hz... Therefore, 1 video frame corresponds to 16000/25 = 640Hz
if new_wav.shape[0] != 3200: # 640 * 5 = 3200 (5 because we are considering 5 video frames)
continue
orig_mel = audio.melspectrogram(new_wav).T
if (orig_mel.shape[0] != 17): # It seemed that the resulting mel-spectrogram always has a size of [17, 80]
continue
Any advice/assistance would be greatly appreciated. Thanks!
Hi @Mayur28, I guess it would be better to leave the mel_step_size as it was at 16. To get the proper mel spectrogram you have to load more of the audio file as the last window could start at the end of your audio. default win size is set to 800 = 50ms in the hparams -> add + 50 to the lenght. But in my investigations I monitored when the if clause triggers. As it was too often due to tot short mel spectrograms I figured out that adding 3 frames would be a lenght which does not unnecessary prolong the computation but results in very few to short melspectrograms. After computing the longer melspectrogram i just use the first 16 = syncnet_mel_step_size entries.
`mel = audio.melspectrogram(wav).T[:syncnet_mel_step_size]
if (mel.shape[0] != syncnet_mel_step_size):
print(f'wrong mel shape {mel.shape[0]}, - continue' )
continue`
By the way I also noticed that I had the learning_rate set to 1e-5 during the run where I had better convergence.
I been experiment with a few variants of the original implementations and, as advised by this issue, I have verified that the model is able to converge much quicker (in my case, the training error is already at 0.4 after 20K steps).
Sadly that suggestion did not help me. But I noticed that if the model starts getting better than 0.69 is very learning rate dependent. 1e-5 was the only value which worked for me to get it too move under 0.69 in less than 500k iterations. But I still struggle with over fitting...
@vokshin
What follows from this? Out of sync. I wrote a simple bash script that validates the data after preprocess.py has processed it. It traverses the directories and divides the number of frames received by the audio length in seconds. In theory, we should get a number tending to video fps. However, sometimes you can get a value that deviates greatly from fps. For example, here is some of what I got for lrs2 after preprocess.py:
At first I adopted your metric to check my data set, but now I think your metric is to weak. As the video and audio become longer, the quotient will tend towards 25 and a few dropped frames will become unnoticed. But for the issue of synchronization a few dropped images will destroy the data instance completely. If you miss 5 frames in the beginning of a 12 sec video, the quotient will still be close to 25, but this is outside of the syncnet_T of wav2lip. I would suggest to look at the difference in audio and video length.
@GGaryuk , yes, you are right. This metric cannot be considered reliable. This is a rather rough and primary estimate of the synchronization loss. But this still allows you to detect a loss of synchronization. For example, I met a ratio of 23 in a video with fps=28. Is it possible to imagine how big the offset is here?
Recently, I was advised to apply a different solution in preprocessing.py. If there is no face in the frame, it is necessary to cut a piece from the general melspectrogram corresponding to this frame. Presumably, then we will completely avoid the loss of synchronization.
@vokshin, I think your proposed solution sounds promising. Will try it out as well. I'm not sure how applicable will it be in the case of LRS2 since to the best of my knowledge, every frame contains a face, thus, there will be no need to cut the audio?
I think with the handful of ideas that we have all made to try and get the model to converge, we are definitely heading in the direction. Thus far, what surprises the most is that, even when using the original code as is, the model does not converge. I wonder if there were any other tricks used (that have not been mentioned) to train the provided pretrained model in order for it converge.
@vokshin Your suggestion looks like how to control missing data. Drop the data is one of the methods. Otherwise, copying prev or next frame if the frame cannot detect faces like filling NA. But, this method has the assumption that blank frames aren't continuous 5 or more.
I'm not sure how big of a difference this would make when preprocessing, but I noticed that the length of the extracted audio is not the same as the video. This is the case even without considering face detection. It might have something to do with the way that ffmpeg is extracting the audio and because the audio and video samples have different durations, the samples could be out of sync to a minor degree. As an example, I am working with clip 5536038039829982468/00026, the video is 0.92 seconds long, but the extracted audio is 0.96 seconds.
Perhaps someone could provide some insight to this whether it is worth considering or not. Thanks
So guys, I think I have some understanding of our problem.
First of all thanks to @GGaryuk for the advice to take 0.2s of audio to calculate the corresponding mel spectrogram. This helped me a lot. Loss 0.42 at era 30.
The authors tell us that the problem is in the synchronization of our datasets. It's not so much about datasets, but about the data preprocessed using prepropcess.py. preprocces.py uses a face detection tool. This tool does not work perfectly. Those frames on which no faces are detected are swept aside. As a result, we get fewer frames at the output than there are in the video. However, the length of the audio remains the same.
What follows from this? Out of sync. I wrote a simple bash script that validates the data after preprocess.py has processed it. It traverses the directories and divides the number of frames received by the audio length in seconds. In theory, we should get a number tending to video fps. However, sometimes you can get a value that deviates greatly from fps. For example, here is some of what I got for lrs2 after preprocess.py:
23.848 24.857 24.553 24.375 24.493 24.771 24.609 24.796 24.147 23.958
This is not a bad result. The deviation from 25 is not large. On my own dataset, these deviations are much worse. I believe this is the reason.
Guys, I made a mistake in my statement. The frame in which the face is not detected is numbered and skipped in the script preprocess.py. This allows you not to lose synchronization in color_syncnet_train.py. I apologize.
Those, such metric cannot be applied at all in this case.
Hello!
Have you guys checked the pre-trained weights for Syncnet that we receive from developers? I decided to check the validation.As a result, I got loss=0.53 for the train lrs2 sub-dataset and loss=0.45 for the val sub-dataset. Maybe I made a mistake in something? Can anyone check these weights as well? At the same time, the README says that Syncnet should be trained to 0.25 on our own datasets.
By the way, I managed to train only down to 0.36.
Hi @vokshin,
When I attempted to further train the provided pre-trained weights for SyncNet, the validation loss starts at 0.2581 and the training loss is at 0.2785. It seems that their pre-trained model was trained for 1.75 million steps. Despite the several attempts that I have made to train SyncNet from scratch, I could only get down to 0.30.
Hi @vokshin,
When I attempted to further train the provided pre-trained weights for SyncNet, the validation loss starts at 0.2581 and the training loss is at 0.2785. It seems that their pre-trained model was trained for 1.75 million steps. Despite the several attempts that I have made to train SyncNet from scratch, I could only get down to 0.30.
How long does it take for 1.75 million steps? and what is your infrastructure for computer?
When I trained the model from scratch and if I were to train the model for 1.75 million steps, the lowest validation loss that I could achieve was 0.3 - I found the original implementation to be very slow (took 4-5 days), therefore, I implemented my own version of the Wav2Lip solution that is identical conceptually, but I just changed how certain things are implemented (to improve efficiency and now it only takes around 2 days). I am training my code on a high performance cluster with a GeForce RTX 3090 GPU.
When I trained the model from scratch and if I were to train the model for 1.75 million steps, the lowest validation loss that I could achieve was 0.3 - I found the original implementation to be very slow (took 4-5 days), therefore, I implemented my own version of the Wav2Lip solution that is identical conceptually, but I just changed how certain things are implemented (to improve efficiency and now it only takes around 2 days). I am training my code on a high performance cluster with a GeForce RTX 3090 GPU.
Can you explain your idea? I'm also implementing this model
I made several changes to the code structure and used several alternate libraries that are more efficient.
I made several changes to the code structure and used several alternate libraries that are more efficient.
Can you suggest some libraries?
Hi, I am also running into the same issue as you guys. I am using a subset of the AVSpeech dataset. The network seems to be hard stuck at 0.69. I once tried training it for a few million iterations but it did not show any sign of learning at all. I suspect the problem is in the syncronisation of my dataset. Have you guys @vokshin @Mayur28 @i-amgeek @yonglianglan tried to sync-correct the dataset with SyncNet? If so could you give me a hint how to do that.
The reason why I assume the problem is in the sync of the given dataset is when I desperately made some changes to the dataloader it started training better. I changed the loading of the audio file, instead of computing the melspectrogram on the whole audio file I only import the relevant 0.2 sec of the audio and compute the melspectrogram of that snippet. This reduces the possible discretization offset by <12.5 ms. Somehow this small change made the network train, but very slowly. Also it started overfitting at a loss of about 0.55 :/ That is why I want to explore the sync-correction with SyncNet.
This is my Loss Curve after the afore mentioned changes. And for reference before the changes. I assume yours look similar?
Hi @GGaryuk I also want to train on the AvSpeech dataset, but I am unable to download that and preprocess it for training, do you have any such scripts for that and can you tell me how you trained it on AvSpeech? Or can you make the script opensource?
I made several changes to the code structure and used several alternate libraries that are more efficient.
Hi @Mayur28 , Can you mention few libraries that you used and lead to faster training?
So guys, I think I have some understanding of our problem. First of all thanks to @GGaryuk for the advice to take 0.2s of audio to calculate the corresponding mel spectrogram. This helped me a lot. Loss 0.42 at era 30. The authors tell us that the problem is in the synchronization of our datasets. It's not so much about datasets, but about the data preprocessed using prepropcess.py. preprocces.py uses a face detection tool. This tool does not work perfectly. Those frames on which no faces are detected are swept aside. As a result, we get fewer frames at the output than there are in the video. However, the length of the audio remains the same. What follows from this? Out of sync. I wrote a simple bash script that validates the data after preprocess.py has processed it. It traverses the directories and divides the number of frames received by the audio length in seconds. In theory, we should get a number tending to video fps. However, sometimes you can get a value that deviates greatly from fps. For example, here is some of what I got for lrs2 after preprocess.py: 23.848 24.857 24.553 24.375 24.493 24.771 24.609 24.796 24.147 23.958 This is not a bad result. The deviation from 25 is not large. On my own dataset, these deviations are much worse. I believe this is the reason.
Guys, I made a mistake in my statement. The frame in which the face is not detected is numbered and skipped in the script preprocess.py. This allows you not to lose synchronization in color_syncnet_train.py. I apologize.
Those, such metric cannot be applied at all in this case.
Hi @vokshin I have gone through preprocess.py, but didn't find the code that is ignoring the audio mels related to the skipped frames. Can you quote the code link or line number where you found that?
Hi @TejaswiniiB ,
I primarily used PyTorch. For audio management I used torchaudio, for image handling I used torchvision. From my experience, I don't think that the choice of software used had a huge influence on the huge performance gains I achieved. Instead, I think it had to do more with the code structure and how the software was used. Just in general, I feel that the code in the repo somewhat under-utilizes the software.
Just need more time .......
This is my training loss of "lip-sync expert" (with default training config)on LRS2 (train set). The loss stay 0.68 about 30 hours on GTX 1080ti , they it began decrease....
Hi @vokshin, When I attempted to further train the provided pre-trained weights for SyncNet, the validation loss starts at 0.2581 and the training loss is at 0.2785. It seems that their pre-trained model was trained for 1.75 million steps. Despite the several attempts that I have made to train SyncNet from scratch, I could only get down to 0.30.
How long does it take for 1.75 million steps? and what is your infrastructure for computer?
hi, May I ask you some questions? 1.what NF and MV stands for in test.txt? 6330311066473698535/00011 NF 330311066473698535/00018 MV
2.Does 'Conf' means confidence in 00001.txt?
Text: WHEN YOU'RE COOKING CHIPS AT HOME
Conf: 4
I ran into the same problem when I was trying to train syncnet on avspeech. The loss seems struck at 0.69. but when I use Adamax instead, the problem seems fixed.
I ran into the same problem when I was trying to train syncnet on avspeech. The loss seems struck at 0.69. but when I use Adamax instead, the problem seems fixed.
Hi, can you explain a bit more about what exactly did you do? I am also trying to train it on avspeech. Also are you sync correcting the videos?
So guys, I think I have some understanding of our problem.
First of all thanks to @GGaryuk for the advice to take 0.2s of audio to calculate the corresponding mel spectrogram. This helped me a lot. Loss 0.42 at era 30.
The authors tell us that the problem is in the synchronization of our datasets. It's not so much about datasets, but about the data preprocessed using prepropcess.py. preprocces.py uses a face detection tool. This tool does not work perfectly. Those frames on which no faces are detected are swept aside. As a result, we get fewer frames at the output than there are in the video. However, the length of the audio remains the same.
What follows from this? Out of sync. I wrote a simple bash script that validates the data after preprocess.py has processed it. It traverses the directories and divides the number of frames received by the audio length in seconds. In theory, we should get a number tending to video fps. However, sometimes you can get a value that deviates greatly from fps. For example, here is some of what I got for lrs2 after preprocess.py:
23.848 24.857 24.553 24.375 24.493 24.771 24.609 24.796 24.147 23.958
This is not a bad result. The deviation from 25 is not large. On my own dataset, these deviations are much worse. I believe this is the reason.
Thanks for your share. And it solved my problem.
Same problem here. Using HDTF dataset and 2 v100s, and the loss mysteriously gets stuck at 0.69. I am still running it, hoping that it can improve in 3 days.
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
hello,I have the same problem. My train loss is come to 0.3, but the value loss is to 0.8. What is your validation set loss?
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
hello,I have the same problem. My train loss is come to 0.3, but the value loss is to 0.8. What is your validation set loss?
Hi. It is also about 0.3. The syncnet may suffer overfitting, so watch it closely(using a logging will be beneficial).
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
hello,I have the same problem. My train loss is come to 0.3, but the value loss is to 0.8. What is your validation set loss?
Hi. It is also about 0.3. The syncnet may suffer overfitting, so watch it closely(using a logging will be beneficial).
@Crestina2001 hi, do you train from scratch or finetune from lipsync_expert.pth? When I finetune lipsync_expert.pth on my own data, the train loss decreases very slowly (to ~0.4), however, the eval loss remains ~0.52. I have tried to stop the training and retrain several times but with the same result. Can you give me some advice?
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
hello,I have the same problem. My train loss is come to 0.3, but the value loss is to 0.8. What is your validation set loss?
Hi. It is also about 0.3. The syncnet may suffer overfitting, so watch it closely(using a logging will be beneficial).
@Crestina2001 hi, do you train from scratch or finetune from lipsync_expert.pth? When I finetune lipsync_expert.pth on my own data, the train loss decreases very slowly (to ~0.4), however, the eval loss remains ~0.52. I have tried to stop the training and retrain several times but with the same result. Can you give me some advice?
I am back. The problem can be solved by training more epochs!! After training for 130,000 epochs, the loss come to 0.33 at HDTF dataset. Recommend to keep a log to avoid overfitting.
hello,I have the same problem. My train loss is come to 0.3, but the value loss is to 0.8. What is your validation set loss?
Hi. It is also about 0.3. The syncnet may suffer overfitting, so watch it closely(using a logging will be beneficial).
@Crestina2001 hi, do you train from scratch or finetune from lipsync_expert.pth? When I finetune lipsync_expert.pth on my own data, the train loss decreases very slowly (to ~0.4), however, the eval loss remains ~0.52. I have tried to stop the training and retrain several times but with the same result. Can you give me some advice?
@kalyo-zjl hi, i wonder if the lipsync_expert can be finetune on your own datasets. The author in this repo said "You must train the expert discriminator for your own dataset before training Wav2Lip."
@yuweimian-shy I'm not sure. I success to train from scratch, the key is to sync the data strictly. But I still suffer from overfitting a little bit, where the training loss is ~0.28 and the val loss stays ~0.35.
@kalyo-zjl Looks great! The result feels ready for training wav2lip. Which datasets are you using? is it Chinese datasets? And have you adjusted the hypercam? such as lr or batch size?
@kalyo-zjl Looks great! The result feels ready for training wav2lip. Which datasets are you using? is it Chinese datasets? And have you adjusted the hypercam? such as lr or batch size?
The datasets are collected from the website, mainly English. The hyperparams are set as default.
@kalyo-zjl Looks great! The result feels ready for training wav2lip. Which datasets are you using? is it Chinese datasets? And have you adjusted the hypercam? such as lr or batch size?
The datasets are collected from the website, mainly English. The hyperparams are set as default.
Thank you! I took your advice and did a strictly sync on my dataset, It's not stuck at 0.69 anymore. But I'm suffer from overfitting badly. Maybe my dataset hours are too small? Total length of approximately 10 hours, 124 speakers.
@kalyo-zjl Looks great! The result feels ready for training wav2lip. Which datasets are you using? is it Chinese datasets? And have you adjusted the hypercam? such as lr or batch size?
The datasets are collected from the website, mainly English. The hyperparams are set as default.
Thank you! I took your advice and did a strictly sync on my dataset, It's not stuck at 0.69 anymore. But I'm suffer from overfitting badly. Maybe my dataset hours are too small? Total length of approximately 10 hours, 124 speakers.
Can you tell me how to do a strictly sync ? I met the same problem, the training loss is around 0.3, but the val loss is 0.5~0.6 and more steps would lead to higher eval loss
@wvinzh me too
When I trained the model from scratch and if I were to train the model for 1.75 million steps, the lowest validation loss that I could achieve was 0.3 - I found the original implementation to be very slow (took 4-5 days), therefore, I implemented my own version of the Wav2Lip solution that is identical conceptually, but I just changed how certain things are implemented (to improve efficiency and now it only takes around 2 days). I am training my code on a high performance cluster with a GeForce RTX 3090 GPU.
Can you explain your idea? I'm also implementing this model
Hey, would be interested in your code :) Do you still have it available?
I made several changes to the code structure and used several alternate libraries that are more efficient.
Hi @Mayur28 , Can you mention few libraries that you used and lead to faster training?
Maybe the changes are using spectrum norm and adamax optimizer. These two are really helpful when training syncnet and wav2lip.
Using own dataset to train the expert lips syncnet network, the loss function has not dropped and remained around 0.69. Does anyone know the reason? thank