Cortexelus commented 6 years ago

Problem

Amplitudes are min-max normalized, for each audio example loaded from the dataset.

Bad for three reasons:

First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.

Second reason: Each example has different peaks, so each example will have a different quantization value for silence.

Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.

I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.

The normalization happens in linear_quantize

Audio normalized upon loading:

def __getitem__(self, index):
        (seq, _) = load(self.file_names[index], sr=None, mono=True)
        return torch.cat([
            torch.LongTensor(self.overlap_len) \
                 .fill_(utils.q_zero(self.q_levels)),
            utils.linear_quantize(
                torch.from_numpy(seq), self.q_levels
            )
        ])

(Example) linear_dequantize(linear_quantize(samples)) != samples

# quantize the wav amplitude into 256 levels
q_levels = 256
# Plot original wav samples
plot(samples)
# samples = tensor([ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000])

# Linearly quantize the samples
lq = linear_quantize(samples, q_levels)
plot(lq)
# lq = tensor([ 133,  133,  133,  ...,  133,  133,  133])
# note, silence should be 128

# Unquantize the samples
ldq = linear_dequantize(lq, q_levels)
plot(ldq)
# tensor([ 0.0391,  0.0391,  0.0391,  ...,  0.0391,  0.0391,  0.0391])
# introduction of DC offset. 
# instead, this should be silent 0.0000, 0,0000, 0.0000, ...

Solution

Don't normalize with linear_quantize

def linear_quantize(samples, q_levels):
    samples = samples.clone()
    samples += 1
    samples /= 2
    samples *= q_levels - EPSILON
    samples += EPSILON / 2
    return samples.long()

Cortexelus commented 6 years ago

The original SampleRNN also has this issue, though the amplitude normalization happens per-batch https://github.com/soroushmehr/sampleRNN_ICLR2017/issues/24

StefOe commented 6 years ago

Interestingly, this became an issue for me when I introduced Batch Normalization to the Network. Thanks for the hint!

blibliki commented 4 years ago

Hey! @Cortexelus, We applied the fix you proposed but keep facing DC Offset. What could be other places that could pinpoint such behavior?

Thanks in advance See https://github.com/wekaco/samplernn-pytorch/commit/0b8a43484c4b7aaa332748eb28718fa73488a96d

deepsound-project / samplernn-pytorch

Bad amplitude normalization #21

Problem

Audio normalized upon loading:

(Example) linear_dequantize(linear_quantize(samples)) != samples

Solution