Data leakage in the validation set

tihbe commented 1 year ago

Hello,

Thank you for organizing this neuromorphic competition.

When following the instructions to generate the training and validation set from the README:

Training dataset: python noisyspeech_synthesizer.py -root <your dataset folder>
Validation dataset: python noisyspeech_synthesizer.py -root <your dataset folder> -is_validation_set true

The validation set generated is the same as the training set. There isn't any difference in the selection of the samples for both sets, except for the amplitude and snr of the resulting files.

The noisyspeech_synthesized.py script uses glob to list the clean and noise files from the DNS challenge, which is a deterministic function. The list of files is then shuffled with the random library: https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge/blob/35c5ef8bdbe4f71efc51427545b0880a42276bd4/noisyspeech_synthesizer.py#L205

The problem is that the script sets a random seed at the beginning, which results in the exact same shuffling between training and validation, and thus the generation of the same files: https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge/blob/35c5ef8bdbe4f71efc51427545b0880a42276bd4/noisyspeech_synthesizer.py#L34

bamsumit commented 11 months ago

It should not. The is_validaion_set gets propagated into is_test_set which changes the way audio samples are generated deep inside the build_aduio method.

In addition, We have done a thorough verification of the synthesized dataset and there are no repeated samples. However, if you find some repeated samples, do let us know.

tihbe commented 11 months ago

Thank you for your response. I believe that I understand better. The files are the same for training and validation set until a >30s validation file is generated. There is then a random chance to shift the idx by 1 which affects subsequent files. Is this correct ?

At minimum, fileid_0 should be the same in both training and validation. fileid_1 and subsequent files could also be the same depending on the sound selection. In my case, I guess I had the bad luck of having a dozen+ files before the shift happened, which caused a fairly noticeable copy of files between the training and validation sets (when listening to files sorted by id).

Why not shuffle the file names differently in training and validation so that both set are more independent ? The order of the sounds is still the same for the clean files (sound "a" always leads to sound "b" and so forth).

I've created a Colab notebook to showcase this effect: https://colab.research.google.com/github/tihbe/Intel-N-DNSChallenge-Synthesize-Bug/blob/master/reproduce_bug.ipynb. The clean files are always the same in training and validation. The noise files at id 0 and id 1 are the same, and then the synthesizer luckily diverged between training and validation.

IntelLabs / IntelNeuromorphicDNSChallenge

Data leakage in the validation set #17