Closed staplesinLA closed 4 years ago
What model are you training and what are the hyper params? Can you upload a few sound samples?
My guess is that the output amplitude is unconstrained and goes out of the -1/+1 range. Wav files are clipped above those values (and soundfile doesn't correct that, rightfully IMO). So you should rescale your audio outputs as done is eval.py
for example: non-intrusive rescaling to match the amplitude of the mixture.
What model are you training and what are the hyper params? Can you upload a few sound samples?
These infos would also help indeed.
What model are you training and what are the hyper params? Can you upload a few sound samples?
Thanks for helping!! I use Conv-tasnet on 16K data. I test a audio sampled from the training set, trying to see the quantization, and it's shown as below:
The source data are all compressed to -1~1, and the estimation seems to prefer int16 values. I don't know why, maybe it's because of the SI-SNR loss?
and the estimation seems to prefer int16 values
The amplitude is unconstrained, this is a flaw of the SI-SNR loss. The values are still float32 though. See the above comment to solve this issue.
Also, @jonashaag probably meant to upload "sound samples" that we can listen to :wink:
By the way, are you integrating the audio samples into tensorboard?
By the way, are you integrating the audio samples into tensorboard?
No, I open it in CoolEdit. I follow your suggestion, and manually divide the outputs by 32768 to compress it into -1~1. Then I use Soundfile to write it, it gets back to normal. So, it's my wrong to convert it to int16 at the beginning, right? Though it's listened very well at the first 20 epochs.
Thank you so much !!! @mpariente @jonashaag I close it for now, and try to do a complete review.
I don't think it learns this scale. The scale will be different for each training.
I don't think it learns this scale. The scale will be different for each training.
yes, I rephrase it to avoid misleading. So it's better to do normalization before generating waveforms.
Have a look at the eval.py file to see how we do it. We normalize the estimates to have the same amplitude as the mixture. It's not the best non-intrusive guess we can do but that's better than -1/1 normalization.
@mpariente Oh thanks, I used the former scripts, I found it in current version. Looks like I should pay more attention to the updates. Thanks again!!
Hi everyone, thanks first for the remarkable program, it's great! Thanks for your efforts. (1)When I listen to the generated the audios, it's messy. I think soundfile.write directly writes the data as float32, after I change it as
sf.write('1.wav', estimate.astype(np.int16), 16000).
, it gets back to normal.(2)Another question is, I found that the longer I trained, the worse listening quality I got. It's wired, because the performance curves of training loss and development loss are all good. I listen to the audios, and find that: the separated audios during first 20 epochs are good, towarding a good direction. After that, the amplitude of speech changes dramatically which often generates a swath. I also find it in the training set, again, it's abnormal that the performance loss is optimized well at the same time.
Can anybody help with it? I will check it deeper though. Thanks!!!
Environment