FOSDEM / video

81 stars 20 forks source link

Audio loudness normalization #181

Open ghosty-be opened 3 years ago

ghosty-be commented 3 years ago

an often read complaint was that talks were too silent and Q&A following that too loud ... and there is a wide variety in volumes over all talks...

indiscipline commented 3 years ago

First of all, equalisation is a specific term for manipulating frequency contents of a sound. The issue described here is called Loudness Normalization. Issue is due for renaming.

While this is a great goal, it's rather hard to enforce live. The speakers have to have a reliable way for measuring their output loudness. The suitable measurement units are Short-Term LUFS or RMS, but not every piece of software provides the meters for that.

yoe commented 3 years ago

So, SReview (the tool that we use for postprocessing and transcoding videos, and which we this year repurposed for also handling upload and preprocessing) actually has builtin support for loudness normalization, using bs1770gain. However, since that software mangles the audio in more ways than just "loudness normalization", which was causing bugs in SReview (apart from it being written by a right-wing extremist nazi), it was disabled for the upload processing for FOSDEM 2021.

I recently implemented loudness normalization using the ffmpeg loudnorm filter, which should allow me to disable the bs1770gain implementation; so if FOSDEM 2022 is still going to be online (in whole or in part), this issue should be fixed.

indiscipline commented 3 years ago

The issue with loudnorm is that it doesn't guarantee linear normalization even in double-pass mode. I've bumped into this with my ffmpeg-loudnorm-helper thingy. For some dynamic content loudnorm doesn't apply compression/limiting to fit peaks into the required range and falls back to dynamic mode which sometimes results in sudden jumps of loudness and overall inferior sound compared to a proper chain of compression and loudness normalization.

yoe commented 3 years ago

Darn, didn't know that (I only just wrote the loudnorm based normalization). In that case I suppose we're not there yet :-/

Do you have any better suggestions for implementing audio loudness normalization?

indiscipline commented 3 years ago

This will probably be a long comment.

The thing is, for the best results you need to control both the dynamic range and the final loudness of the content. Doing it by hand is super easy due to multiple metering and visual clues available and usually takes a bit of back-and-forth passes. Also, standard tools process audio in 32bit floating point which eliminates the issue of clipping at the intermediate stages.

The dumb approach could be something along the following algo which remotely resembles the manual routine:

  1. Determine the average loudness and apply gain to normalize it to a baseline.
  2. Apply gentle (low-ratio, 1.2:1 to 2:1) compression with slow release, wide knee and the threshold set to catch excessively loud sections of audio. This works in seconds time resolution.
  3. Apply more strong (2:1 to 6:1) compression with faster attack and faster release but with the threshold set about 50% higher (closer to 0dB) to catch loud sounds. Time resolution in 50ms to 2s.
  4. Apply linear normalization to a given LUFS level.
  5. Apply brickwall limiting to prevent clipping. Threshold no higher than -1dB. This is a stage dealing with the smallest time resolution.

Assumptions:

  1. No clipping can happen (32bFP).
  2. The audio wasn't dynamically processed during recording.

To gracefully deal with the second assumption the tool could track a few moving averages (akin to a short-term LUFS) on the time scales of the points 2 and 3 and if there's no crossing some predetermined levels it means the audio already fits into the required dynamic range corridor and the appropriate processing stage can be skipped.

This routine is based on the approach described in my article which I proposed as go-to reference for voice processing for FOSDEM: https://indiscipline.github.io/post/voice-sound-reference/#strategies-for-applying-processing The main idea is to deal with macro- and microdynamics sequentially and then setting the final loudness.

Manual processing has a benefit of not necessarily relying on compression for dealing with dynamics on stages 2 and 3, as an engineer can clearly see the portions of the audio which fall below/above the average and adjust the gain accordingly. This is what loudnorm seems to be trying to simulate, but the volume swings are often unnatural and the timing of applying gain adjustments are unreliable. Tracking loudness state ("the average volume shifted" - i.e. distance to a microphone changed, "short volume outlier" - i.e. a loud phrase, exciting moment, etc., "loud sounds happening" - laughter, coughing, dropping things, etc.) and applying simple gain corrections based on it would be preferable to just relying on compression, but this requires a much more complex solution.

UPD: Proper dynamic processing units allow reacting to a loudness measured not only in peak but in RMS, which in a way gets one closer to tracking state which I described above. You can set multiple such units to process audio based on different time resolutions.

Also, I still suggest renaming the issue.

indiscipline commented 3 years ago

Ah, another addition. My previous post is all about post-precessing. This is suboptimal to properly adjusting the sound on the recording end. If the settings were off and the sound was recorded distorted or too noisy or mangled by overeager noise-reduction or abysmal codecs then there's almost nothing you can do about it in post-processing. Preparing some standard protocol of setting things up for the recording can go a long way.

yoe commented 3 years ago

Yeah, okay, afraid that sounds a bit too complex...

My problem is that I'm dealing with audio which can be literally anything:

  1. Video that was pre-recorded by the speaker, who doesn't know what they're doing, but at least used a headset or something so sound levels are consistent;
  2. Live recordings with terrible mics and people being all over the place;
  3. Live recordings with proper mics with an audio tech who knows what they're doing, and that are pretty much already OK when given to SReview

and I want something "reasonable" to roll out. For each case, that would be:

  1. Adjust the gain linearly over the entire file so that things end up at the correct level
  2. Probably do dynamic loudness normalization
  3. Probably do "nothing", or very close to it, so that the hard work of the audio tech is not wasted (at worst, some finetuning so that we end up as close to -23LUFS as we can get)

Without any manual work (because that's the whole design goal of SReview: "do as much automated as possible")

I'm know I'm asking for AI or some magic code that will DWIM without any effort, but I don't need perfection; I just want to get as close as possible to those three results. The alternative is that a (too) small team will have to manually balance 600+ videos in two weeks, and that's just not possible.

I just found ffmpeg also has an "ebur128" filter, which I guess I can look into for more options, but for now I'll stick to loudnorm, accept that it won't be perfect, and put this on the backburner in case I have time left at some undefined point in the future (yeah, right). Alternatively, patches are definitely welcome ;-)

indiscipline commented 3 years ago

The logical simplification of the steps I proposed is sticking one gentle compressor before loudnorm to decrease the number of fallbacks to dynamic normalization. This will hurt case 3 a bit, but unfortunately this is the rarest case and you probably aren't optimizing for it.

yoe commented 3 years ago

It's the rarest at FOSDEM, but it's not always the rarest.

It's true that I'm not optimizing for it, though; and if there's a conference where it is guaranteed to always have happened, it's always just possible to disable the normalization in SReview, so then that shouldn't hurt anymore.

Thanks for your input, you've given me some food for thought. Not sure I'll find the time to implement this any time soon, but at least I know how to improve matters should it be necessary.

indiscipline commented 3 years ago

I'll be glad to help further. Feel free to contact me on matrix with any questions. Not sure I'll be able to contribute any code, though.

markvdb commented 3 years ago

Note that the Sennheiser AVX series microphones we use at FOSDEM have a relatively sophisticated automated gain control builtin.

Playing with a normalisation algorithm on top of that quickly becomes rather hairy...

Op vr 25 jun. 2021 om 16:13 schreef Kirill @.***>:

Ah, another addition. My previous post is all about post-precessing. This is suboptimal to properly adjusting the sound on the recording end. If the settings were off and the sound was recorded distorted or too noisy or mangled by overeager noise-reduction or abysmal codecs then there's almost nothing you can do about it in post-processing. Preparing some standard protocol of setting things up for the recording can go a long way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FOSDEM/video/issues/181#issuecomment-868530578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF6ECBHYD3GZTMFDJP4CQTTUSFINANCNFSM4XH7KYEQ .

-- Mark Van den Borre Hogestraat 16 3000 Leuven, België +32 486 961726

yoe commented 3 years ago

On Fri, Jun 25, 2021 at 10:43:22AM -0700, Mark Van den Borre wrote:

Note that the Sennheiser AVX series microphones we use at FOSDEM have a relatively sophisticated automated gain control builtin.

Not as relevant as you might think, for three reasons:

At any rate, when I rewrote the audio loudness normalization functionality a while back, I made it much easier to also completely disable loudness normalization, so we can very easily switch it off if we want to.

Playing with a normalisation algorithm on top of that quickly becomes rather hairy...

Not really. If the audo loudness levels are similar over the entire video, there isn't that much that SReview can do badly.

-- @.***{be,co.za} wouter@{grep.be,fosdem.org,debian.org}

abitrolly commented 1 year ago

Will https://github.com/complexlogic/rsgain/ be of any help here?

yoe commented 1 year ago

Will https://github.com/complexlogic/rsgain/ be of any help here?

From that page:

rsgain applies loudness metadata tags to your files, while leaving the audio stream untouched

That's not what we are trying to do; we want to create an audio stream that has the correct loudness levels, rather than leaving it at "original" values and adding tags (so a media player can correct).

There are a number of standards for audio loudness levels which we try to follow which are being adhered to by most TV broadcasters; so adhering to that means you can play your video on a TV set and you won't need to adjust your volume (hopefully...)

Additionally, rsgain is meant for a music library, and that is reflected in the container formats it supports; none of them are containers that support video, there are only audio containers.

So, thanks for the suggestion, but no, that won't be of any help.

abitrolly commented 1 year ago

@yoe am I right that once rsgain annotated the demuxed audio stream, it is possible to render the stream with annotations into proper normalized audio and mux it back into the video?

yoe commented 1 year ago

Possibly, but it probably also won't give us an EBU R.128 style loudness normalization, so it isn't very useful really.

abitrolly commented 1 year ago

Why EBU R.128 is so important? Looks like ReplainGain 2.0 is newer.

abitrolly commented 1 year ago

Some links for EBU.

yoe commented 1 year ago

ReplayGain is meant for portable media players; EBU R.128 is meant for broadcast audio.

The two do not serve similar purposes. "Newer" is irrelevant here :-)

abitrolly commented 1 year ago

I still don't get why audio loudness of ReplayGain (or normalization for media streams in files) should be worse than EBU R.128 (or normalization for broadcast streams)? If it is not worse, then why it is not useful?

indiscipline commented 1 year ago

I still don't get why audio loudness of ReplayGain (or normalization for media streams in files) should be worse than EBU R.128 (or normalization for broadcast streams)? If it is not worse, then why it is not useful?

It's almost the same thing (as it's based on ITU BS.1770-3, while EBU R.128 is ~ ITU BS.1770-2), but it's just a draft and not a finished and agreed upon standard. EBU R.128 is already in use and very likely might be enforced in some way or another during delivery, so it's prudent to conform.

yoe commented 1 year ago

I still don't get why audio loudness of ReplayGain (or normalization for media streams in files) should be worse than EBU R.128 (or normalization for broadcast streams)? If it is not worse, then why it is not useful?

Sigh.

We already use the ffmpeg loudnorm filter, which does approximately the same thing as rsgain (although optimizations can certainly be added, as explained before). This comes for free with ffmpeg, which is already a dependency.

This "rsgain" thing that you point to does not show any advantages over ffmpeg, but adds extra dependencies (that we then have to install on our systems) and doesn't support video files -- which means you have to extract the audio, perform the normalization, and then join the audio back together again. We used to do this when audio normalization was implemented using bs1770gain, and it caused various problems, not least of which were A/V desync issues (on top of the other ... issues ... with bs1770gain that I won't go into here).

So, please accept that I've looked at the problem space, understand it reasonably well, and know how to deal with it. Suggesting a switch to $tool (where $tool is "not ffmpeg") is not helpful, unless it is accompanied by a thorough technical explanation that shows you know how audio normalization works and why $tool would be better than an ffmpeg-based approach.

Thanks,

yoe commented 1 year ago

So, please accept that I've looked at the problem space, understand it reasonably well, and know how to deal with it. Suggesting a switch to $tool (where $tool is "not ffmpeg") is not helpful, unless it is accompanied by a thorough technical explanation that shows you know how audio normalization works and why $tool would be better than an ffmpeg-based approach.

To expand on this a bit more:

There are a million values for $tool which claim they can do audio normalization "automatically" and "for free" and they're all lying, because audio normalization is not really something you can do automatically, because the human ear is a very complicated and weird thing. It's reasonably easy to accomplish for audio that is meticulously edited by audio professionals (such as a music album), but to get it to work correctly on a bunch of audio that comes from a of source that can be literally anything from "the worn-out builtin microphone of an old laptop at too large a distance" and "a professional recording microphone used correctly" is a completely different story.

This is not a simple "ah I know this thing that I used over my music library so let's just use that" thing.

markvdb commented 10 months ago

Not relevant since no more prerecordings + remote speakers.

yoe commented 10 months ago

Is still relevant, as we still want to do audio normalization in postprocessing for things that happened on-site.

abitrolly commented 10 months ago

unless it is accompanied by a thorough technical explanation that shows you know how audio normalization works and why $tool would be better than an ffmpeg-based approach

That's exactly what I expect from open source. ) Pretty appreciate all talks and explanations that tell me things I would never be able to discover otherwise.

audio normalization is not really something you can do automatically, because the human ear is a very complicated and weird thing

The more I study acoustics in rooms and recording for movies, the more I excited about the whole stuff. When I talk to people about sound, they all listen like a child to a fairytale. This topic is fascinating exactly because its is ubiquitous and weird.

but to get it to work correctly on a bunch of audio that comes from a of source that can be literally anything from "the worn-out builtin microphone of an old laptop at too large a distance" and "a professional recording microphone used correctly" is a completely different story

AI systems can do this perfectly well, like they do to pictures, but for that they need to copy the human expertise, and for that some humans needs to share this expertise with the world to make it possible. That's would be a humane approach to AI. Also given that "AI" or rather "ML" was in digital sound processing probably from the very beginning.

krokodilerian commented 6 months ago

Do we still have this in the rooms? The microphones we use (Sennheiser AVX) seem to be doing a great job with their AGC, and the rest is up to the video team not to screw up the levels when setting up the system.

I think the only issue I've had was when someone really screwed up things, and there's a limit how much we can protect against that :)