csteinmetz1 / auraloss

Collection of audio-focused loss functions in PyTorch
Apache License 2.0
745 stars 67 forks source link

Enhancement ? New metric for source separation, measuring separately bleed and fullness in separated audio #79

Open jarredou opened 3 weeks ago

jarredou commented 3 weeks ago

Hi,

I've found a simple way to objectively measure bleed and fullness in context of music source separation that I think could be useful as I haven't seen any existing objective metric doing this, while it's a common question from users.

Here is code as a metric:

def bleed_full(ref, est, sr=44100):
    # STFT parameters
    n_fft = 4096
    hop_length = 1024
    n_mels = 512

    # Compute Mag STFTs
    D1 = np.abs(librosa.stft(ref, n_fft=n_fft, hop_length=hop_length))
    D2 = np.abs(librosa.stft(est, n_fft=n_fft, hop_length=hop_length))

    # Convert to mel spectrograms
    mel_basis = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels)
    S1_mel = np.dot(mel_basis, D1)
    S2_mel = np.dot(mel_basis, D2)

    # Convert to decibels
    S1_db = librosa.amplitude_to_db(S1_mel)
    S2_db = librosa.amplitude_to_db(S2_mel)

    # Calculate difference
    diff = S2_db - S1_db

    # Separate positive and negative differences
    positive_diff = diff[diff > 0]
    negative_diff = diff[diff < 0]

    # Calculate averages
    average_positive = np.mean(positive_diff) if len(positive_diff) > 0 else 0
    average_negative = np.mean(negative_diff) if len(negative_diff) > 0 else 0

    # Scale with 100 as best score
    bleedness = 100  / (average_positive + 1)
    fullness = 100 / (-average_negative + 1)

    return bleedness, fullness

I guess it can be adapted as losses, but I'm not dev/scientist and I'm lacking knowledge to make it bulletproof, if it worth it, you should know better than me.

Same concept can be used to draw spectrograms with, for example: bleed/positive values (red), missing content/negative values (blue), perfect separation = 0 (white): image

turian commented 1 week ago

@jarredou I'm curious about this. So basically:

Instead of doing l1 mel spectral distance, you separate it into two components: 1) Bleed = anything ADDED to the target spectrogram 2) -Fullness = anything REMOVED from the target spectrogram

I see you do MSS work. I noted in the BS-Roformer paper that the authors wrote: "our model outputs gained more preference from musicians and educators than from music producers in the listening test of SDX23". To my ears, bs-roformers seem to have have less bleed but less fullness. I'd be curious if you have any numbers to share. (cc @ZFTurbo )

jarredou commented 4 days ago

@turian Yeah, that's the simple idea behind the 2 metrics.

About the BS-Rofomer quote, it's from this final paper from SDX/MDX23 contest https://arxiv.org/pdf/2308.06979

We don't have numbers between different neural network models. For now, the metrics was only used to evaluate different fine-tuned versions made on top of Kimberley's Melband-Rofomer model the results are accessible here https://docs.google.com/spreadsheets/d/1pPEJpu4tZjTkjPh_F5YjtIyHq8v0SxLnBydfUBUNlbI/edit and it was made using mvsep.com multisong eval dataset.

ZFTurbo has added the torch version of the metric to his training script a few days ago.