breakfastquay / rubberband

Official mirror of Rubber Band Library, an audio time-stretching and pitch-shifting library.
http://breakfastquay.com/rubberband/
GNU General Public License v2.0
561 stars 89 forks source link

Realtime mode results in distortions, maybe clipping #84

Open bergfried opened 1 year ago

bergfried commented 1 year ago

First of all, thank you for this awesome library. The new bugfix release (3.2.1) greatly improved the quality! However, I kind of hoped that it might fix another bug I just noticed. Unfortunately, it is still present (for example, it was already present in 3.1.2, if inferred correctly based on my package cache).

When using realtime mode (-R), the output is not identical to its input even if the parameters are set in a way that it should (ignoring rounding errors). Consider the following:

(The sampling frequency does not matter much. However, some distortions are more pronounced when using a suitable low sampling frequency, making them audible, at least if you listen carefully.)

I analyzed the result using Tenacity, subtracting the input version from the output version by adding its inverse. The differences between the processed and the original version I noticed are:

Note that if I do not use realtime mode (by not using -R), the result is as expected, i.e. the difference between input and output is close to silence. (Actually, in this case, it is not perfect silence, which would be best, of course, but that is a minor issue compared to what this bug report is about. I can imagine, though, that even these tiny differences will disappear as soon as this bug is fixed.)

Since rubberband --full-help does not say anything about it, I consider this a bug. Also, unfortunately, it makes rubberband unsuitable for cases where realtime mode is necessary, for example, when using it for pitch-correct time scaling in mpv:

mpv --af-add=rubberband input.wav

As described above, the distortions during playback are present even if there is no time scaling or pitch shifting taking place.

Please keep up the good work!

cannam commented 1 year ago

This is strictly speaking a limitation of the API.

The underlying cause is the need to avoid audible discontinuities when the stretch or shift factor is changed during live processing, specifically in the case where it is changed to 1.0 or from 1.0 to something else.

In the R2 engine, pure time-stretching at a factor of 1.0 does essentially nothing and there is no issue moving from 1.0 to some other factor in time only. But pitch-shifting is an issue, which is why the OptionPitchHighConsistency flag is provided. Without it, there is a discontinuity when moving up or down from 1.0. With it, an additional resampler pass is required even at exactly 1.0.

In the R3 engine there is an additional multi-channel frequency handling layer. This can cause audible artifacts when enabled or disabled, so in real-time mode it is enabled always, even at 1.0. This is not usually audible with music, but can be observed with a test signal as you have found.

Ideally the channel handling would be totally transparent, and I would like to improve it in that direction if possible. Failing that, there probably ought to be another option, analogous to the OptionPitch set, that determines whether to keep processing active even at 1.0 or to permit artifacts when switching the time ratio away from 1.0.

Of course we do also have far too many options already...

bergfried commented 1 year ago

Thank you for your reply, but, admittedly, I am not really satisfied.

In the R3 engine there is an additional multi-channel frequency handling layer. This can cause audible artifacts when enabled or disabled, so in real-time mode it is enabled always, even at 1.0. This is not usually audible with music, but can be observed with a test signal as you have found.

Maybe it is just me, but the issue is very noticeable not only in artificial test files but real-world audio, music and especially singing as well. Just pick some real-world music with a singing voice, convert it to 24000 Hz sampling rate and process it with rubberband and the set of switches mentioned above, i. e.

zresample --rate 24000 --wav --16bit --rec audio.wav audio-converted-24000.wav &&
rubberband -3 -R -f1 audio-converted-24000.wav audio-converted-24000-stretched.wav

Note the (probably audible) difference between audio-converted-24000.wav and audio-converted-24000-stretched.wav.

Also note that the distortions are very noticeable even for mono input, so I do not see how the "additional multi-channel frequency handling layer" is supposed to cause this issue in those cases.

Apparently, one can avoid at least some audible distortions by upsampling beforehand, e. g.

zresample --rate 48000 --wav --16bit --rec audio-converted-24000.wav audio-reconverted-48000.wav &&
rubberband -3 -R -f1 audio-reconverted-48000.wav audio-reconverted-48000-stretched.wav

That is, the sampling rate presented to rubberband seems to have a quite noticeable effect on its output even if, signal-wise, the input stays more or less the same. This is the main reason why I think that there is a bug.

Anyway, I ran a few experiments.

First, I tried upsampling input.wav (the 24000 Hz sampling rate chirp file created in the first post) to 48000 Hz before processing. As expected, the audible distortions were reduced, but, surprisingly, the peak near 1.7 seconds is more pronounced. So I tried upsampling to 192000 Hz, and the peak reaches even higher.

At that point, I started suspecting something like accumulated rounding errors, and the higher the sampling rate, the more errors to accumulate. Or maybe aliasing depending on the sampling rate, who knows. So I tried something more interesting: Upscaling by a different factor than 2.

zresample --rate 44100 --wav --16bit --rec input.wav input-converted-44100.wav &&
rubberband -3 -R -f1 input-converted-44100.wav input-converted-44100-stretched.wav

As suspected, something interesting happened: The most pronounced peak moved! The overall shape (waveform) changed as well, and the other peaks appear less pronounced. What is going on here?

I also had a look at the spectrograms. I visually inspected the spectrograms of all audio files I tested (both real-world audio as well as artificial test files) and compared them with their respective rubberband-processed versions. The following pattern appears to emerge:

What do you think? Is there anything new to you? If you don't see any flaw that might point to a bug, I would be glad if you could explain the observations I made. Just curious. (But if it happens to help finding hidden bugs, that would be awesome!) Feel free to ask if you have any questions.

cannam commented 1 year ago

Thank you for your reply, but, admittedly, I am not really satisfied.

Well, I am not trying to claim the situation is ideal, just that it is a logical compromise. Time-stretching introduces artifacts. Ideally those artifacts would disappear entirely at a ratio of 1.0. Also ideally, switching from 1.0 to another ratio during playback should be "noiseless" (i.e. introduce no clicks, or other artifacts that are not there during fixed-ratio processing).

For a variable-rate timestretcher it's possible to make a case for the latter requirement as more important than the former (within reason). If you want a 1.0 ratio most of the time and don't mind a click when switching away from 1.0, you can either use offline mode, or bypass the timestretcher when the ratio is 1.0. Whereas if the timestretcher itself clicks when switching to or from 1.0, there is nothing that you as the user can do about it.

As I said, this is an area I'd like to improve in future releases - ideally by improving the method so as to satisfy both requirements, but at least with an option. It isn't an issue with the R2 stretcher for various reasons, but the cause of this behaviour with R3 is essentially connected with why R3 usually sounds better in the first place.

convert it to 24000 Hz sampling rate

The API docs note that 44.1 or 48kHz are the intended rates for Rubber Band - other rates "should produce sensible output" but are not advised. Perhaps the command-line tool should also say this.

Also note that the distortions are very noticeable even for mono input, so I do not see how the "additional multi-channel frequency handling layer" is supposed to cause this issue in those cases.

Frequency channels, not audio file channels (sorry! that was confusing). The artifacts you're observing are around channel boundaries. The precise boundary frequencies may change depending on the content of the signal.