etzinis / two_step_mask_learning

A two step optimization for sound source separation on the adaptive front-end domain
67 stars 16 forks source link

The preprocessing results in less than 20000 mixtures in training set of wsj0-mix2. #1

Closed fjiang9 closed 5 years ago

fjiang9 commented 5 years ago

Thanks for sharing the code! It's a really great work of audio source separation! I have a question about the preprocess_wsj0mix.py: As the length of some audios in wsj0-mix2 is shorter than 4 sec, after performing the codes in line 139-140, some audios are discarded after this preprocessing. The result is that there are only 17075 mixtures in the training set when using the "min" folder (this number should be 19885 when using the "max" folder). This is mismatched with the number (20000) mentioned in Section 3.2.1 of the paper. So I was wondering how many samples are finnally used in the experiment of speech separation in this paper?

etzinis commented 5 years ago

Yes actually for the data augmentation experiments (where I am onling mixing the audio sources) I needed all the files to have at least 4 secs of audio so I chopped them. In the WSJ case this is not needed. I will try to push a newer and cleaner version of this code after I am done with some deadlines that I have.

Glad that you liked the work and I would be happy to answer other questions, if you have any! :)

etzinis commented 5 years ago

Section 3.2.1. refers to the WSJ case only. With no online mixing so you can use all mixtures after zero-padding.

etzinis commented 5 years ago

Sorry for this confusion but I wanted to match the experiments from other works but also create the 4sec online mixing procedure described in 3.2.2 in the paper.

fjiang9 commented 5 years ago

@etzinis Thank you so much for your kind reply! I am still running the speech separation experiment code. Looking forward to your new release : )

etzinis commented 5 years ago

Because as I said before, I will not be able to put out the new release until I am finished with some urgent things. I would suggest you to just create the WSJ with the matlab script provided here: http://wham.whisper.ai/README.html and then use my script with wav_timelength=4s https://github.com/etzinis/two_step_mask_learning/blob/master/two_step_mask_learning/utils/preprocess_wsj0mix.py

I will leave this issue open in order to fix it on the new release.

etzinis commented 5 years ago

So I have rechecked what you said and it seems that indeed when using the 'max' folder from the wsj2mix dataset you have the following files created: Training: 19855 Testing: 2988 Validation: 4980

and this is caused because as you said in lines 139-140 I have discarded the files with a duration lower than 4secs. However, this amount is like neglecting 0.7% on the training dataset and 0.4% on the testing and validation which I consider is negligible compared to the total size of the dataset. Moreover zeros do not contribute to any SI-SDR loss so either-way I am just making my configuration a tiny bit harder than the initial setup. If you want to just use the remaining 0.4% as well you can just zero pad in 139-140 lines.

I close this issue for now.

etzinis commented 5 years ago

I have also added the code of padding now in the corresponding lines so the output distribution of samples will be: Training: 20000 Testing: 3000 Validation: 5000

etzinis commented 5 years ago

Thanks for noticing that @flyjiang92 🍺 😃

fjiang9 commented 5 years ago

@etzinis Thank you so much for your response and the code updating! 👍