fix dithering when exporting to float

faroit commented 3 years ago

In the last version of stempeg the method to load PCM through python-ffmpeg resulted in small differences to what I obtained with the previous version of stempeg.

This PR reverts the casting procedure for int16 to float32 + normalization [-1, 1]. This is important since dithering errors did significantly influence separation scores in regression tests.

With this PR, ffmpeg pipes int16 into the numpy buffer and is converted and normalized to float in numpy. This got the same results as I used to have before where I used temporarly wav files, converting to float32 using soundfile. Furthermore, when using int16 pipes, conversion is slightly faster.

Also ping @romi1502 and @mmoussallam since this function originally derived from spleeters code. You might want to change it there as well.

romi1502 commented 3 years ago

Hi @faroit, sorry for the long delay before reacting. I don't get why you want to convert to int16 before converting back to float32. This will possibly results in a loss of precision (for instance if you try to load a 24 bit PCM file) and moreover I think that dithering will be only applied by ffmpeg if the signal is converted to 16 bits after a computation (such as resampling or mp3 decoding) at a higher precision (most likely float32). The only case where it could make sense to convert in int16 is when there is no such computation, which means when you load a 16 bits PCM file at its original sample rate. From the test I did, in this case, both piping float32 from ffmpeg and piping int16/converting to float32 gives exactly the same result (no errors between both signals).

faroit commented 3 years ago

Salut Romain

i needed to merge this in quickly so that my regression tests continue to pass. I would love to find a better solution for this problem.

I don't get why you want to convert to int16 before converting back to float32. This will possibly results in a loss of precision (for instance if you try to load a 24 bit PCM file) and moreover I think that dithering will be only applied by ffmpeg if the signal is converted to 16 bits after a computation (such as resampling or mp3 decoding) at a higher precision (most likely float32).

i don't think the common use-case would result in loss of precision. 99% of the users audio is saved as PCM16 and will be converted to float32 or float64.
you are right about the dithering. I think the issue is that I previously used soundfile to convert int to float and they use a different standardization than ffmpeg, hence there is an amplitude offset of one sample... you know, this 32767 vs 32768 thing ;-)

From the test I did, in this case, both piping float32 from ffmpeg and piping int16/converting to float32 gives exactly the same result (no errors between both signals).

thats expected, given that its both using ffmpegs conversions. Can you test with using soundfile or scipy.io in comparison?

btw. the reason this has significant impact on separation results is that the MUSDB18 stems are compressed individually to mp4. With slight decoding differences between references and estimates, bsseval will scores this badly. There is no problem however, when both, the references and the estimates use the same way to decode. But my regression tests also saved the decoded wav files, so I have to comply ;-)

romi1502 commented 3 years ago

i don't think the common use-case would result in loss of precision. 99% of the users audio is saved as PCM16 and will be converted to float32 or float64.

Well, don't tell that to an audiophile ;). Yes, I agree that most end-consumer audio will be PCM16 (while some production material may be at a higher precision). Still I think that keeping the data pipeline the most general and the least affecting the original data is better.

you know, this 32767 vs 32768 thing ;-)

Unfortunately, I know :(. But I don't have this difference in the latest version of sndfile (with the tests I did).

thats expected, given that its both using ffmpegs conversions. Can you test with using soundfile or scipy.io in comparison?

When only loading is done (no resampling), all methods (ffmpeg float32, ffmpeg int16, scipy.io.wavefile, sndfile) give the exact same results (up to the dtype) from a 16 bits PCM wavefile. As stated in the previous post, differences only appears when some processing (decoding and/or resampling) is done in ffmpeg in float32 and the results is piped in int16 (after dithering).

With slight decoding differences between references and estimates, bsseval will scores this badly

Well, this is one of the reasons why I've never liked much bsseval :). Too bad this is still the best way to evaluate source separation.

But my regression tests also saved the decoded wav files, so I have to comply ;-)

Ok, I get it.

faroit commented 3 years ago

@romi1502 okay, I will give this another try to make it work with float32. Can you provide me with your test scripts?

romi1502 commented 3 years ago

Here is the script I used:


import os

import ffmpeg
import numpy as np
import scipy.io.wavfile
import soundfile as sf

path = "/path/to/audio.wav"

def load_audio_ffmpeg(path, dtype=np.float32, n_channels=2):
    """
        load audio file from ffmpeg
    """
    if dtype == np.float32:
        ffmpeg_format = "f32le"
        numpy_dtype = '<f4'
    elif dtype == np.int16:
        ffmpeg_format = "s16le"
        numpy_dtype = '<i2'
    else:
        raise ValueError(f"Unknown dtype: {dtype}")

    output_kwargs = {'format': ffmpeg_format}
    process = (
        ffmpeg.input(path)
              .output('pipe:', **output_kwargs)
              .run_async(pipe_stdout=True, pipe_stderr=True))

    buffer, _ = process.communicate()
    waveform = np.frombuffer(buffer, dtype=numpy_dtype).reshape(-1, n_channels)
    return waveform

def s16_to_f32(waveform_s16):
    """ convert int16 waveform to normalized float32 waveform """
    return waveform_s16.astype(np.float32)/np.float32(32768)

waveform_s16 = load_audio_ffmpeg(path, dtype=np.int16)
fs, waveform_sio = scipy.io.wavfile.read(path)

waveform_f32 = load_audio_ffmpeg(path, dtype=np.float32)
waveform_sndfile, samplerate = sf.read(path)

print("Average absolute error between ffmpeg int16 and sndfile loading:")
print(np.abs(s16_to_f32(waveform_s16) - waveform_sndfile).mean())
print("Average absolute error between ffmpeg float32 and sndfile loading:")
print(np.abs(waveform_f32 - waveform_sndfile).mean())
print("Average absolute error between scipy.io and sndfile loading:")
print(np.abs(s16_to_f32(waveform_sio) - waveform_sndfile).mean())

On a stereo 16bits PCM wavefile, all absolute errors are 0.

faroit commented 3 years ago

@romi1502 thanks for the script. I can confirm that these yield to the exact same results, however, when testing on MP3, AAC, the results are different (See examples here). Now, i don't know enough about codecs, but i guess this is expected?

For me (and MUSDB18, musdb and museval) it sadly means that users would get different scores when doing this conversion differently. I guess what I would do now is to revert this pull request so that you can utilize stempeg in spleeter but addtionally allow to pass "compatibility" flag to convert through int16 as intermediate format.

Any other idea to solve this?

faroit / stempeg

fix dithering when exporting to float #39