Probelm of reproducing DeepFilterNet2 results

tzuyun-huang-ss commented 5 months ago

Hi,

Thanks for your amazing work. I had tried to re-train the deepfilternet2 model using the DNS-4 challenge dataset. In order to align with setting you mentioned in paper, I have revised config.ini you provided on github as below.

[train]
batch_size_scheduling = 0/8,1/16,2/24,5/32,10/64,20/96
[multiresspecloss]
fft_sizes=240.480,960,1920

First tried

At first, I split DNS4 to 70%/15%/15% (train/test/valid) in audio list without shuffle (similar data will be used in the same stage), but I always got the result worse than model you provided. I evaluate the DNSMOS on DNS4 blind test set. Re-trained model got 2.939 on OVRL score, but the model you provided got 3.019. I tried to find out why, so I drew the loss curve and found that the validation loss is always lower than the training loss much. As shown below: loss_wo_shuffle

Second tried

Next, I tried to shuffle the audio list and re-generate HDF5 file. I got validation loss unstable although training loss and validation loss are relatively close, this trend is unreasonable. Additionally, our best model in 31th epoch, there is a big gap with the best model in 96th epoch you provided. loss_w_shuffle

Third tried

So I looked through the issue on github to see if there was anything I missed. I saw you said adding VCTK dataset can improve speech quality #38 . And then, I add VCTK into our my dataset and merge audio files so that each audio file is no less than 3 seconds. Then I reused the VCTK audio file 10 times, replacing by a factor of 10. Now, I trained to the 34th epoch, but the validation curve still seems to be very unstable. loss_w_shuffle_vctk

Question

Do you have any idea on the trend of loss curve ? Or, Can you provide your training log for me? On the other hand about speech quality, the spectrogram below shows the difference in results. spec1 spec2

We can see that the results of the retrained model cannot achieve the same results as the model you provided. Your result suppressed more low-frequency noise and has clearer speech, but my result leaves more low-frequency noise and is blurry. Do you have any idea about this difference? Or are there any other suggestions to improve this gap?

hfwanguanghui commented 4 months ago

Can you share your config.ini and train.log files ,let me help you to check your trian config.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Rikorose / DeepFilterNet