Deep filter spectrogram normalisation

mattpitkin commented 4 months ago

I have noticed that the normalisation of the complex spectrogram features for the deep filtering is not doing what is expected (as described in, say, equation 12 of https://ieeexplore.ieee.org/document/9855850). In the band_unit_norm and band_unit_norm_t functions in lib.rs, the estimates of the mean of the absolute values of the spectrogram (i.e., estimates of the standard deviation of the spectrogram) are square rooted before being used for normalisation, but I don't think the square root should be applied (it's not the variance that is being estimated).

I've tested the spectrograms with and without the square root on the noisy_snr0.wav file. With the square rooting (i.e., the current code), I get:

from df.enhance import init_df, df_features
from df.io import load_audio

model, df_state, _ = init_df()
audio, meta = load_audio("noisy_snr0.wav", 48000, "cpu")
spec, erb_spec, feat_spec = df_features(audio, df_state, 96)

# get standard deviation across time for each frequency
feat_spec.squeeze().std(axis=0)
tensor([[0.1591, 0.0000],
        [0.0860, 0.0697],
        [0.0549, 0.0647],
        [0.0660, 0.0672],
        [0.0794, 0.0768],
        [0.0854, 0.0898],
        [0.0815, 0.0774],
        [0.0863, 0.0810],
        [0.0846, 0.0840],
        [0.0892, 0.0942],
        [0.0704, 0.0788],
        [0.0740, 0.0697],
        [0.0747, 0.0701],
        [0.0859, 0.0967],
        [0.0655, 0.0760],
        [0.0397, 0.0504],
        [0.0350, 0.0424],
        [0.0403, 0.0462],
        [0.0375, 0.0414],
        [0.0395, 0.0428],
        [0.0376, 0.0400],
        [0.0298, 0.0326],
        [0.0342, 0.0453],
        [0.0360, 0.0502],
        [0.0340, 0.0331],
        [0.0309, 0.0305],
        [0.0350, 0.0298],
        [0.0346, 0.0347],
        [0.0491, 0.0447],
        [0.0545, 0.0405],
        [0.0482, 0.0416],
        [0.0415, 0.0346],
        [0.0356, 0.0361],
        [0.0364, 0.0405],
        [0.0563, 0.0501],
        [0.0534, 0.0427],
        [0.0349, 0.0406],
        [0.0388, 0.0317],
        [0.0464, 0.0376],
        [0.0486, 0.0595],
        [0.0445, 0.0623],
        [0.0323, 0.0409],
        [0.0247, 0.0300],
        [0.0292, 0.0252],
        [0.0233, 0.0221],
        [0.0206, 0.0199],
        [0.0185, 0.0222],
        [0.0210, 0.0211],
        [0.0309, 0.0321],
        [0.0375, 0.0307],
        [0.0444, 0.0312],
        [0.0289, 0.0272],
        [0.0225, 0.0227],
        [0.0256, 0.0212],
        [0.0225, 0.0269],
        [0.0280, 0.0318],
        [0.0316, 0.0343],
        [0.0338, 0.0290],
        [0.0303, 0.0295],
        [0.0374, 0.0292],
        [0.0383, 0.0349],
        [0.0401, 0.0392],
        [0.0335, 0.0376],
        [0.0316, 0.0292],
        [0.0307, 0.0233],
        [0.0291, 0.0248],
        [0.0237, 0.0300],
        [0.0252, 0.0314],
        [0.0260, 0.0277],
        [0.0226, 0.0278],
        [0.0227, 0.0247],
        [0.0251, 0.0227],
        [0.0215, 0.0205],
        [0.0223, 0.0245],
        [0.0310, 0.0316],
        [0.0284, 0.0305],
        [0.0239, 0.0294],
        [0.0239, 0.0260],
        [0.0267, 0.0259],
        [0.0271, 0.0277],
        [0.0210, 0.0260],
        [0.0227, 0.0246],
        [0.0242, 0.0249],
        [0.0263, 0.0229],
        [0.0307, 0.0238],
        [0.0282, 0.0249],
        [0.0289, 0.0234],
        [0.0228, 0.0255],
        [0.0263, 0.0234],
        [0.0289, 0.0242],
        [0.0307, 0.0311],
        [0.0319, 0.0295],
        [0.0258, 0.0279],
        [0.0294, 0.0235],
        [0.0267, 0.0230],
        [0.0271, 0.0256]])

The spectrogram is not really unit normalised.

Whereas, if I "fix" the normalisations by removing the square root, and repeat the same thing, I get:

feat_spec.squeeze().std(axis=0)
tensor([[2.3296, 0.0000],
        [1.3393, 1.1812],
        [0.9863, 1.2168],
        [1.1237, 1.1676],
        [1.2367, 1.1289],
        [1.1755, 1.2254],
        [1.1668, 1.1551],
        [1.3028, 1.2769],
        [1.4278, 1.4436],
        [1.5003, 1.5904],
        [1.3261, 1.4367],
        [1.3822, 1.3554],
        [1.4817, 1.3798],
        [1.7359, 1.8143],
        [1.5310, 1.6814],
        [1.1912, 1.4886],
        [1.1763, 1.4043],
        [1.3484, 1.4549],
        [1.2866, 1.3960],
        [1.3701, 1.5043],
        [1.3465, 1.4387],
        [1.1227, 1.2723],
        [1.1466, 1.4495],
        [1.2039, 1.6014],
        [1.3519, 1.2652],
        [1.2835, 1.2271],
        [1.3928, 1.1938],
        [1.3658, 1.3786],
        [1.7323, 1.4352],
        [1.8235, 1.3663],
        [1.6639, 1.4657],
        [1.5581, 1.3309],
        [1.3403, 1.4950],
        [1.4371, 1.6201],
        [1.9739, 1.7786],
        [1.9526, 1.6100],
        [1.4855, 1.5853],
        [1.5229, 1.3026],
        [1.6172, 1.4281],
        [1.6101, 1.8892],
        [1.5990, 2.1108],
        [1.4510, 1.7498],
        [1.2546, 1.3937],
        [1.3603, 1.2353],
        [1.1743, 1.1317],
        [1.1172, 1.1020],
        [1.0012, 1.2032],
        [1.1398, 1.1405],
        [1.3787, 1.4434],
        [1.5753, 1.2559],
        [1.7045, 1.3374],
        [1.3102, 1.2665],
        [1.1642, 1.1396],
        [1.3129, 1.0739],
        [1.1107, 1.3278],
        [1.2442, 1.4601],
        [1.4026, 1.4837],
        [1.5103, 1.3081],
        [1.3626, 1.2847],
        [1.5241, 1.1882],
        [1.5153, 1.3750],
        [1.5315, 1.4012],
        [1.3444, 1.4306],
        [1.3527, 1.2662],
        [1.4215, 1.0987],
        [1.3362, 1.2002],
        [1.0996, 1.3726],
        [1.1805, 1.3284],
        [1.2345, 1.2658],
        [1.1338, 1.4018],
        [1.2715, 1.3374],
        [1.4306, 1.2368],
        [1.2349, 1.1801],
        [1.2234, 1.3517],
        [1.4583, 1.5467],
        [1.4278, 1.4832],
        [1.2620, 1.5466],
        [1.2846, 1.4759],
        [1.4174, 1.3683],
        [1.4542, 1.5140],
        [1.2170, 1.4376],
        [1.2494, 1.3206],
        [1.3139, 1.3805],
        [1.4034, 1.3058],
        [1.5339, 1.3417],
        [1.4687, 1.3555],
        [1.5251, 1.2322],
        [1.2840, 1.3854],
        [1.4100, 1.2823],
        [1.4906, 1.2987],
        [1.4625, 1.5008],
        [1.5403, 1.4719],
        [1.3185, 1.5270],
        [1.4710, 1.2591],
        [1.4023, 1.2440],
        [1.3820, 1.3304]])

and the number are close to 1.

In practice, I've trained a model with the "fix" and it doesn't seem to make a noticeable difference (although admittedly I've not done a throughout set of tests).

mattpitkin commented 4 months ago

I've done some testing on the Valentini dataset using the evaluation metrics within DeepFilterNet and find very little difference between using the "fixed" normalisation and the original. Below is a plot, for each metric, of the % relative difference between metric values when using a model trained with the "fixed" (new) normalisation and the current (original) normalisation as a function of audio example.

In general, on average the "fixed" normalisation does better across all metrics except SSNR (which is the same on average). Although the difference is obviously rather marginal.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Rikorose / DeepFilterNet

Deep filter spectrogram normalisation #514