asteroid-team / asteroid

The PyTorch-based audio source separation toolkit for researchers
MIT License
2.29k stars 423 forks source link

Abnormal separated wavs #250

Closed staplesinLA closed 4 years ago

staplesinLA commented 4 years ago

Hi everyone, thanks first for the remarkable program, it's great! Thanks for your efforts. (1)When I listen to the generated the audios, it's messy. I think soundfile.write directly writes the data as float32, after I change it as sf.write('1.wav', estimate.astype(np.int16), 16000). , it gets back to normal.

(2)Another question is, I found that the longer I trained, the worse listening quality I got. It's wired, because the performance curves of training loss and development loss are all good. I listen to the audios, and find that: the separated audios during first 20 epochs are good, towarding a good direction. After that, the amplitude of speech changes dramatically which often generates a swath. I also find it in the training set, again, it's abnormal that the performance loss is optimized well at the same time.

Can anybody help with it? I will check it deeper though. Thanks!!!


jonashaag commented 4 years ago

What model are you training and what are the hyper params? Can you upload a few sound samples?

mpariente commented 4 years ago

My guess is that the output amplitude is unconstrained and goes out of the -1/+1 range. Wav files are clipped above those values (and soundfile doesn't correct that, rightfully IMO). So you should rescale your audio outputs as done is for example: non-intrusive rescaling to match the amplitude of the mixture.

What model are you training and what are the hyper params? Can you upload a few sound samples?

These infos would also help indeed.

staplesinLA commented 4 years ago

What model are you training and what are the hyper params? Can you upload a few sound samples?

Thanks for helping!! I use Conv-tasnet on 16K data. I test a audio sampled from the training set, trying to see the quantization, and it's shown as below:


The source data are all compressed to -1~1, and the estimation seems to prefer int16 values. I don't know why, maybe it's because of the SI-SNR loss?

mpariente commented 4 years ago

and the estimation seems to prefer int16 values

The amplitude is unconstrained, this is a flaw of the SI-SNR loss. The values are still float32 though. See the above comment to solve this issue.

Also, @jonashaag probably meant to upload "sound samples" that we can listen to :wink:

mpariente commented 4 years ago

By the way, are you integrating the audio samples into tensorboard?

staplesinLA commented 4 years ago

By the way, are you integrating the audio samples into tensorboard?

No, I open it in CoolEdit. I follow your suggestion, and manually divide the outputs by 32768 to compress it into -1~1. Then I use Soundfile to write it, it gets back to normal. So, it's my wrong to convert it to int16 at the beginning, right? Though it's listened very well at the first 20 epochs.

staplesinLA commented 4 years ago

Thank you so much !!! @mpariente @jonashaag I close it for now, and try to do a complete review.

mpariente commented 4 years ago

I don't think it learns this scale. The scale will be different for each training.

staplesinLA commented 4 years ago

I don't think it learns this scale. The scale will be different for each training.

yes, I rephrase it to avoid misleading. So it's better to do normalization before generating waveforms.

mpariente commented 4 years ago

Have a look at the file to see how we do it. We normalize the estimates to have the same amplitude as the mixture. It's not the best non-intrusive guess we can do but that's better than -1/1 normalization.

staplesinLA commented 4 years ago

@mpariente Oh thanks, I used the former scripts, I found it in current version. Looks like I should pay more attention to the updates. Thanks again!!