This repository conduct ablation studies on local attention (a.k.a band attention) applied in full-band spectrum, namely local spectral attention (LSA). Two full-band speech enhancement (SE) models with spectral attention replace the conventional attention (a global manner) with LSA that only looks at adjacent bands at a certain frequency (a local manner). One model is DPARN, whose source code can be found in https://github.com/Qinwen-Hu/dparn.
The other model is the Multi-Scale Temporal Frequency with Axial Attention (MTFAA) network, which ranked 1st in the DNS-4 challenge for full-band SE, and its detailed description can be found in paper https://ieeexplore.ieee.org/document/9746610. Here we release an unofficial pytorch implementation of MTFAA as well as its modification. This work have been submitted to Interspeech2023.
soundfile: 0.10.3
librosa: 0.8.1
torch: 3.7.10
numpy: 1.20.3
scipy: 1.7.2
pandas: 1.3.4
tqdm: 4.62.3
Split your speech and noise audios into 10 seconds segments and generate the .csv files to manage your data. Prepare your RIR audios of .wav format in one folder. Edit the .csv path in Dataloader.py:
TRAIN_NOISE_CSV = './train_noise_data.csv'
VALID_CLEAN_CSV = './valid_clean_data.csv'
VALID_NOISE_CSV = './valid_noise_data.csv'
RIR_DIR = 'direction to RIR .wav audios'
where the .csv files for clean speech are organized as
file_dir | snr |
---|---|
./clean_0001.wav | 4 |
./clean_0002.wav | -1 |
./clean_0003.wav | 0 |
... | ... |
and the .csv files for noise are organized as
file_dir |
---|
./noise_0001.wav |
./noise_0002.wav |
./noise_0003.wav |
... |
the 'file_dir' and 'snr' denote the absolute direction to audios and signal-to-noise ratio(SNR) respectively.
After environment and data preparation, start to train the model by command:
python Network_Training_MTFAA_full.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c Dir_to_save_the_checkpoint_files -e Epochs_for_training(default is 300) -d Device_used_for_training(cuda:0)
Enhance noisy audios by command:
python Infer.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c path_to_load_the_checkpoint_files -t path_to_folder_containing_noisy_audios -s path_to_folder_saving_the_enhanced_clips -d Device_used_for_training(cuda:0)
We demonstrate the effectiveness of our proposed method on the full-band dataset of the 4th DNS challenge. The total training dataset contains around 1000 hours of speech and 220 hours of noise. Room impulse responses are convolved with clean speech to generate simulated reverberant speech, which is preserved as training target. In the training stage, reverberant utterances are mixed with noise recordings with SNR ranging from -5 dB to 5 dB at1 dB intervals. For the test set, 800 clips of reverberant utterances are mixed with unseen noise types with SNR ranging from -5 dB to 15 dB. Each test clip is 5 seconds long. All utterances are sampled at 48 kHz in our experiments. We also conduct experiments on well-known VCTK-DEMAND dataset for comprehensive validation.
The visualization of LSA mechanism can be seen in the figure below:
The unofficial Pytorch implementation of MTFAA and its LSA-based model can be seen in MTFAA_Net_full.py and MTFAA_Net_full_local_atten.py respectively. As for DPARN, readers may attend to https://github.com/Qinwen-Hu/dparn.
Firstly, we conduct experiments on different setting of Nl based on the VCTK-DEMAND dataset and the results can be seen in table below:
Config. |
Wideband Metrics |
Full-band Metrics |
|||||
Model |
Nl |
PESQ |
CSIG |
CBAK |
COVL |
STOI(%) |
SiSDR(dB) |
MTFAA |
F’/2 |
3.16 |
4.34 |
3.63 |
3.77 |
94.7 |
18.5 |
F’/4 |
3.15 |
4.32 |
3.58 |
3.76 |
94.6 |
18.1 |
|
Sqrt(F‘) |
3.16 |
4.35 |
3.61 |
3.78 |
94.7 |
18.8 |
|
DPARN |
F’/2 |
2.96 |
4.29 |
3.63 |
3.68 |
94.2 |
18.7 |
F’/4 |
2.95 |
4.27 |
3.65 |
3.68 |
94.2 |
18.8 |
|
Sqrt(F‘) |
2.94 |
4.27 |
3.62 |
3.67 |
94.1 |
18.5 |
It can be seen that the setting of Nl affects different models differently and we choose the setting achieving the best performance for each model, i.e. sqrt(F') for MTFAA and F'/2 for DPARN. Next, we train the models with the larger DNS4 dataset and the training process can be seen in figures below, where both LSA-based models achieve better convergence compared with the original models.
The objective test results can be seen in table below
Full-band Metrics |
STOI |
SiSDR (dB) |
|
LSD (dB) |
||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
Band(kHz) |
0~8 |
8~24 |
Full. |
||
Noisy |
0.687 |
0.805 |
0.771 |
-2.515 |
7.971 |
5.166 |
Noisy |
18.37 |
12.38 |
14.38 |
||
MTFAA |
0.805 |
0.876 |
0.856 |
10.10 |
15.74 |
14.23 |
MTFAA |
10.33 |
9.349 |
9.678 |
||
MTFAA-LSA |
0.809 |
0.881 |
0.860 |
10.34 |
16.20 |
14.63 |
MTFAA-LSA |
9.840 |
8.636 |
9.037 |
||
DPARN |
0.752 |
0.858 |
0.828 |
8.461 |
13.71 |
12.31 |
DPARN |
10.92 |
13.11 |
12.38 |
||
DPARN-LSA |
0.757 |
0.861 |
0.831 |
8.617 |
13.84 |
12.47 |
DPARN-LSA |
10.76 |
12.99 |
12.25 |
||
Wideband Metrics |
PESQ |
CSIG |
CBAK |
COVL |
||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
Noisy |
1.160 |
1.446 |
1.364 |
2.023 |
2.719 |
2.517 |
1.833 |
2.481 |
2.293 |
1.571 |
2.095 |
1.943 |
MTFAA |
1.981 |
2.669 |
2.470 |
3.465 |
4.113 |
3.925 |
2.951 |
3.523 |
3.357 |
2.754 |
3.436 |
3.238 |
MTFAA-LSA |
2.084 |
2.795 |
2.589 |
3.517 |
4.203 |
4.004 |
3.006 |
3.593 |
3.423 |
2.829 |
3.547 |
3.339 |
DPARN |
1.702 |
2.309 |
2.134 |
3.136 |
3.759 |
3.580 |
2.505 |
2.859 |
2.757 |
2.447 |
3.069 |
2.890 |
DPARN-LSA |
1.776 |
2.423 |
2.237 |
3.179 |
3.829 |
3.642 |
2.619 |
3.030 |
2.912 |
2.507 |
3.166 |
2.977 |
The proposed LSA improves the enhancement performance of both the casual DPARN and MTFAA models in terms of all objective metrics. To reveal the benefit of LSA mechanism, we visualize the normalized average spectral attention plots, generated from audios in the test set, of attention blocks in both original MTFAA and LSA-based MTFAA, as shown in figures below
It can be seen from the fifth layer of attention that the LSA-based model more effectively emphasizes the structural features of harmonics in low bands (marked with red boxes) and the almost randomly distributed components in high bands (marked with black boxes). Furthermore, it can be seen from the blue boxes that LSA can also effectively alleviate the modeling of the invalid correlation between the low bands and the high bands. Hence, the speech pattern in spectrum can be better modeled by LSA. Further investigation of the enhanced signals reveals that the global attention in frequency domain is more likely to inflict distortion to speech components or produce excessive residual noise in non-speech segments, while this problem can be effectively alleviated by the proposed LSA. Two typical examples are shown in Figure 3, where the benefit of LSA can be clearly seen. A possible explanation is that the better exploitation to speech pattern helps LSA-based model more effectively discriminate speech and noise components especially in low-SNR environments.
To further demonstrate the importance of modeling local correlation in spectrum for full-band SE tasks, we also compare local attention with a recently proposed biased attention method, namely Attention with Linear Biases (ALiBi), which negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query for efficient extrapolation. Its application on spectral attention can be seen in the figure below
We modify the penalty bias to decrease in a square manner for better performance and name the method as ASqBi, indicated in the figures below.
The modified method is combined with MTFAA MTFAA_Net_full_F_ASqbi.py and the ablation test results can be found in the last row of table below. It can be seen that the overall performance degrades compared with LSA. It may be explained that the negative bias added to local attention region weakens the model capability to extract local intercorrelation.
STOI |
SiSDR(dB) |
|
LSD(dB) |
|||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
Band(kHz) |
0~8 |
8~24 |
Full. |
||
MTFAA-ASqBi |
0.811 |
0.881 |
0.860 |
10.425 |
15.944 |
14.468 |
MTFAA-ASqaBi |
10.307 |
9.495 |
9.766 |
||
MTFAA-LSA |
0.809 |
0.881 |
0.860 |
10.347 |
16.201 |
14.635 |
MTFAA-LSA |
9.840 |
8.636 |
9.037 |
||
Wideband Metrics |
PESQ |
CSIG |
CBAK |
COVL |
||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
MTFAA-ASqBi |
2.064 |
2.769 |
2.564 |
3.487 |
4.165 |
3.968 |
2.987 |
3.558 |
3.392 |
2.804 |
3.513 |
3.308 |
MTFAA-LSA |
2.084 |
2.795 |
2.589 |
3.517 |
4.203 |
4.004 |
3.006 |
3.593 |
3.423 |
2.829 |
3.547 |
3.339 |
We also conduct subjective listening preference test on MTFAA model to validate the benifit of LSA mechanism. 50 enhanced samples as well as their reference target speech are randomly selected from the test set. 15 listeners with normal hearing compare the enhanced results based on the reference speech and choose the preferred result. Each sample is evaluated by at least 3 listeners. The subjective listening preference test results can be seen in table below. Over 60% samples enhanced by LSA-based MTFAA are considered to have better perceptual quality and lower noise levels, which demonstrates the efficiency of LSA in full-band SE tasks.
Model |
LSA |
Preference (%) |
MTFAA |
Ⅹ |
38.0 |
√ |
62.0 |
The proposed method also reduces computational complexity in spectral attention and the statistics are given in Table below
Model |
Percentage of complexity
reduction in spectral attention (%) |
MTFAA |
63.2 |
DPARN |
25.4 |
We compare the modified MTFAA model with previous full-band SOTA methods on VCTK-DEMAND dataset and the results are listed in table below
Models |
Year |
Param.(M) |
PESQ |
STOI(%) |
CSIG |
CBAK |
COVL |
Noisy |
- |
- |
1.97 |
92.1 |
3.34 |
2.44 |
2.63 |
RNNoise |
2020 |
0.06 |
2.33 |
92.2 |
3.40 |
2.51 |
2.84 |
PercepNet |
2020 |
8.00 |
2.73 |
- |
- |
- |
- |
CTS-Net(full) |
2020 |
7.09 |
2.92 |
94.3 |
4.22 |
3.43 |
3.62 |
DCCRN |
2020 |
3.70 |
2.54 |
93.8 |
3.74 |
3.13 |
2.75 |
NSNet2 |
2021 |
6.17 |
2.47 |
90.3 |
3.23 |
2.99 |
2.90 |
S-DCCRN |
2022 |
2.34 |
2.84 |
94.0 |
4.03 |
3.43 |
2.97 |
FullSubNet+ |
2022 |
8.67 |
2.88 |
94.0 |
3.86 |
3.42 |
3.57 |
GaGNet |
2022 |
5.95 |
2.94 |
- |
4.26 |
3.45 |
3.59 |
DMF-Net |
2022 |
7.84 |
2.97 |
94.4 |
4.26 |
3.52 |
3.62 |
DS-Net |
2022 |
3.30 |
2.78 |
94.3 |
4.20 |
3.34 |
3.48 |
SF-Net |
2022 |
6.98 |
3.02 |
94.5 |
4.36 |
3.54 |
3.67 |
DeepFilterNet2 |
2022 |
2.31 |
3.08 |
94.3 |
4.30 |
3.40 |
3.70 |
MTFAA
(Cau., LSA) |
2023 |
1.5 |
3.16 |
94.7 |
4.35 |
3.61 |
3.78 |
MTFAA
(Non-cau., LSA) |
2023 |
1.5 |
3.30 |
95.3 |
4.45 |
3.73 |
3.90 |
To further investigate the patterns of spectral attention, we firstly plot the attention figures generated from clean speech of male and female,
it can be seen that the patterns are related to speech characteristics that harmonics are basically distributed in low bands (top-left corner of attention plot) and consonants (almost randomly distributed) are in high bands (bottom-right corner of attention plot). It can also be seen that female's pitches are higher than male's with larger intervals. To be clearly, the harmonics-related lines are not parallel to each other for the frequencies are in the ERB scale.
Then we plot the attention figures generated from noisy speech and the attention plots in decoder are given below, where the harmonic-related features are highlighted,
Ignited by ALiBi, we also conduct experiment on the multi-scale local spectral attention (MSLSA) as shown in the figure below,
the performance of MSLSA can be seen in the table below,
Full-band Metrics |
STOI |
SiSDR (dB) |
|
LSD (dB) |
||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
Band(kHz) |
0~8 |
8~24 |
Full. |
||
Noisy |
0.687 |
0.805 |
0.771 |
-2.515 |
7.971 |
5.166 |
Noisy |
18.37 |
12.38 |
14.38 |
||
MTFAA |
0.805 |
0.876 |
0.856 |
10.10 |
15.74 |
14.23 |
MTFAA |
10.33 |
9.349 |
9.678 |
||
MTFAA-LSA |
0.809 |
0.881 |
0.860 |
10.34 |
16.20 |
14.63 |
MTFAA-LSA |
9.840 |
8.636 |
9.037 |
||
MTFAA-MSLSA |
0.809 |
0.880 |
0.859 |
10.43 |
15.98 |
14.50 |
|
MTFAA-MSLSA |
10.03 |
8.623 |
9.094 |
|
Wideband Metrics |
PESQ |
CSIG |
CBAK |
COVL |
||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
-5~0 |
0~15 |
Ovrl. |
Noisy |
1.160 |
1.446 |
1.364 |
2.023 |
2.719 |
2.517 |
1.833 |
2.481 |
2.293 |
1.571 |
2.095 |
1.943 |
MTFAA |
1.981 |
2.669 |
2.470 |
3.465 |
4.113 |
3.925 |
2.951 |
3.523 |
3.357 |
2.754 |
3.436 |
3.238 |
MTFAA-LSA |
2.084 |
2.795 |
2.589 |
3.517 |
4.203 |
4.004 |
3.006 |
3.593 |
3.423 |
2.829 |
3.547 |
3.339 |
MTFAA-MSLSA |
2.077 |
2.772 |
2.571 |
3.500 |
4.167 |
3.974 |
3.013 |
3.589 |
3.422 |
2.820 |
3.517 |
3.314 |
Wideband Metrics |
SSNR (dB) |
|
||||||||||
SNR(dB) |
-5~0 |
0~15 |
Ovrl. |
|
||||||||
Noisy |
-2.291 |
4.19 |
2.307 |
|
||||||||
MTFAA |
6.550 |
10.13 |
9.094 |
|
||||||||
MTFAA-LSA |
6.609 |
10.26 |
9.200 |
|
||||||||
MTFAA-MSLSA |
6.779 |
10.38 |
9.338 |
|
The best performance of MSLSA is slightly worse than that of conventional LSA, while it can be seen from the training process that MSLSA may achieve a more stable result with a higher mean score and lower variance of PESQ in the validation set, which can be seen in the figure below (statistics based on the last 100 epochs)
Average attention plots of different heads in MSLSA are also given and it can be seen that the global attention head (Nl=F') cannot exploit clear spectral features while the local one (Nl=F'/4) makes it.