Open Cortexelus opened 6 years ago
The original SampleRNN also has this issue, though the amplitude normalization happens per-batch https://github.com/soroushmehr/sampleRNN_ICLR2017/issues/24
Interestingly, this became an issue for me when I introduced Batch Normalization to the Network. Thanks for the hint!
Hey! @Cortexelus, We applied the fix you proposed but keep facing DC Offset. What could be other places that could pinpoint such behavior?
Thanks in advance See https://github.com/wekaco/samplernn-pytorch/commit/0b8a43484c4b7aaa332748eb28718fa73488a96d
Problem
Amplitudes are min-max normalized, for each audio example loaded from the dataset.
Bad for three reasons:
First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.
Second reason: Each example has different peaks, so each example will have a different quantization value for silence.
Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.
I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.
The normalization happens in
linear_quantize
Audio normalized upon loading:
(Example) linear_dequantize(linear_quantize(samples)) != samples
Solution
Don't normalize with linear_quantize