Downmixing to mono behaves differently depending on whether FFMPEG is used for audio loading

Expected behaviour

When loading a stereo audio file and downmixing it to mono, I expect the resulting amplitudes to not depend on the audio file format, but only on the content.

Actual behaviour

Currently, if a wave file has the the same sample type as the one desired when loading, madmom will use scipy to load it; then, to downmix the signal to mono, it will use its own madmom.audio.signal.remix function, which computes the arithmetic mean of the channels.

If the there is a mismatch in sample types (eg. the file is stored as float32 but loaded as float, or stored as 16-bit integers and loaded as float), madmom will use ffmpeg to load the file, and, in the same step, use ffmpeg to downmix to mono.

Now, the downmixing logic of ffmpeg apparantly uses a normalizing factor of 2 / sqrt(2) when downmixing. This results in different amplitudes.

Steps needed to reproduce the behaviour

import madmom
import numpy as np

# chirp.wav is stored as stereo 32-bit float
read_wave = madmom.io.load_audio_file('chirp.wav', num_channels=1, dtype=np.float32)[0]
read_ffmpeg = madmom.io.load_audio_file('chirp.wav', num_channels=1, dtype=np.float)[0]

print(np.nanmedian(read_wave / read_ffmpeg))  # 0.7071...
print(np.nanmedian(((2 * read_wave / np.sqrt(2)) / read_ffmpeg))  # 1.0

Information about installed software

madmom master branch
ffmpeg version 4.4.2-0ubuntu0.22.04.1

CPJKU / madmom