ZhongshuHou / LSA

Ablation study of local spectral attention (LSA) for full-band speech enhancement (SE)
MIT License
26 stars 6 forks source link
full-band local-spectral-attention mtfaa speech-enhancement

Local Spectral Attention for Full-band Speech Enhancement

Visualization of local spectral attention

Contents

Repository description

This repository conduct ablation studies on local attention (a.k.a band attention) applied in full-band spectrum, namely local spectral attention (LSA). Two full-band speech enhancement (SE) models with spectral attention replace the conventional attention (a global manner) with LSA that only looks at adjacent bands at a certain frequency (a local manner). One model is DPARN, whose source code can be found in https://github.com/Qinwen-Hu/dparn.
The other model is the Multi-Scale Temporal Frequency with Axial Attention (MTFAA) network, which ranked 1st in the DNS-4 challenge for full-band SE, and its detailed description can be found in paper https://ieeexplore.ieee.org/document/9746610. Here we release an unofficial pytorch implementation of MTFAA as well as its modification. This work have been submitted to Interspeech2023.

Rquirements

soundfile: 0.10.3
librosa: 0.8.1
torch: 3.7.10
numpy: 1.20.3
scipy: 1.7.2
pandas: 1.3.4
tqdm: 4.62.3

Network training

Data preparation

Split your speech and noise audios into 10 seconds segments and generate the .csv files to manage your data. Prepare your RIR audios of .wav format in one folder. Edit the .csv path in Dataloader.py:

   TRAIN_NOISE_CSV = './train_noise_data.csv'  
   VALID_CLEAN_CSV = './valid_clean_data.csv'  
   VALID_NOISE_CSV = './valid_noise_data.csv'  
   RIR_DIR = 'direction to RIR .wav audios'

where the .csv files for clean speech are organized as

file_dir snr
./clean_0001.wav 4
./clean_0002.wav -1
./clean_0003.wav 0
... ...

and the .csv files for noise are organized as

file_dir
./noise_0001.wav
./noise_0002.wav
./noise_0003.wav
...

the 'file_dir' and 'snr' denote the absolute direction to audios and signal-to-noise ratio(SNR) respectively.

Start training

After environment and data preparation, start to train the model by command:

python Network_Training_MTFAA_full.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c Dir_to_save_the_checkpoint_files -e Epochs_for_training(default is 300) -d Device_used_for_training(cuda:0)

Inference

Enhance noisy audios by command:

python Infer.py -m model_to_train(including MTFAA, MTFAA_LSA or MTFAA_ASqBi) -c path_to_load_the_checkpoint_files -t path_to_folder_containing_noisy_audios -s path_to_folder_saving_the_enhanced_clips -d Device_used_for_training(cuda:0)

Ablation study and experiment results

We demonstrate the effectiveness of our proposed method on the full-band dataset of the 4th DNS challenge. The total training dataset contains around 1000 hours of speech and 220 hours of noise. Room impulse responses are convolved with clean speech to generate simulated reverberant speech, which is preserved as training target. In the training stage, reverberant utterances are mixed with noise recordings with SNR ranging from -5 dB to 5 dB at1 dB intervals. For the test set, 800 clips of reverberant utterances are mixed with unseen noise types with SNR ranging from -5 dB to 15 dB. Each test clip is 5 seconds long. All utterances are sampled at 48 kHz in our experiments. We also conduct experiments on well-known VCTK-DEMAND dataset for comprehensive validation.

LSA on MTFAA and DPARN

The visualization of LSA mechanism can be seen in the figure below:

The unofficial Pytorch implementation of MTFAA and its LSA-based model can be seen in MTFAA_Net_full.py and MTFAA_Net_full_local_atten.py respectively. As for DPARN, readers may attend to https://github.com/Qinwen-Hu/dparn.
Firstly, we conduct experiments on different setting of Nl based on the VCTK-DEMAND dataset and the results can be seen in table below:

Config.

Wideband Metrics

Full-band Metrics

Model

Nl

PESQ

CSIG

CBAK

COVL

STOI(%)

SiSDR(dB)

MTFAA

F’/2

3.16

4.34

3.63

3.77

94.7

18.5

F’/4

3.15

4.32

3.58

3.76

94.6

18.1

Sqrt(F)

3.16

4.35

3.61

3.78

94.7

18.8

DPARN

F’/2

2.96

4.29

3.63

3.68

94.2

18.7

F’/4

2.95

4.27

3.65

3.68

94.2

18.8

Sqrt(F)

2.94

4.27

3.62

3.67

94.1

18.5

 

It can be seen that the setting of Nl affects different models differently and we choose the setting achieving the best performance for each model, i.e. sqrt(F') for MTFAA and F'/2 for DPARN. Next, we train the models with the larger DNS4 dataset and the training process can be seen in figures below, where both LSA-based models achieve better convergence compared with the original models.

The objective test results can be seen in table below

Full-band Metrics

STOI

SiSDR (dB)

 

LSD (dB)

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Band(kHz)

0~8

8~24

Full.

Noisy

0.687

0.805

0.771

-2.515

7.971

5.166

Noisy

18.37

12.38

14.38

MTFAA

0.805

0.876

0.856

10.10

15.74

14.23

MTFAA

10.33

9.349

9.678

MTFAA-LSA

0.809

0.881

0.860

10.34

16.20

14.63

MTFAA-LSA

9.840

8.636

9.037

DPARN

0.752

0.858

0.828

8.461

13.71

12.31

DPARN

10.92

13.11

12.38

DPARN-LSA

0.757

0.861

0.831

8.617

13.84

12.47

DPARN-LSA

10.76

12.99

12.25

Wideband Metrics

PESQ

CSIG

CBAK

COVL

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Noisy

1.160

1.446

1.364

2.023

2.719

2.517

1.833

2.481

2.293

1.571

2.095

1.943

MTFAA

1.981

2.669

2.470

3.465

4.113

3.925

2.951

3.523

3.357

2.754

3.436

3.238

MTFAA-LSA

2.084

2.795

2.589

3.517

4.203

4.004

3.006

3.593

3.423

2.829

3.547

3.339

DPARN

1.702

2.309

2.134

3.136

3.759

3.580

2.505

2.859

2.757

2.447

3.069

2.890

DPARN-LSA

1.776

2.423

2.237

3.179

3.829

3.642

2.619

3.030

2.912

2.507

3.166

2.977

 

The proposed LSA improves the enhancement performance of both the casual DPARN and MTFAA models in terms of all objective metrics. To reveal the benefit of LSA mechanism, we visualize the normalized average spectral attention plots, generated from audios in the test set, of attention blocks in both original MTFAA and LSA-based MTFAA, as shown in figures below

It can be seen from the fifth layer of attention that the LSA-based model more effectively emphasizes the structural features of harmonics in low bands (marked with red boxes) and the almost randomly distributed components in high bands (marked with black boxes). Furthermore, it can be seen from the blue boxes that LSA can also effectively alleviate the modeling of the invalid correlation between the low bands and the high bands. Hence, the speech pattern in spectrum can be better modeled by LSA. Further investigation of the enhanced signals reveals that the global attention in frequency domain is more likely to inflict distortion to speech components or produce excessive residual noise in non-speech segments, while this problem can be effectively alleviated by the proposed LSA. Two typical examples are shown in Figure 3, where the benefit of LSA can be clearly seen. A possible explanation is that the better exploitation to speech pattern helps LSA-based model more effectively discriminate speech and noise components especially in low-SNR environments.

To further demonstrate the importance of modeling local correlation in spectrum for full-band SE tasks, we also compare local attention with a recently proposed biased attention method, namely Attention with Linear Biases (ALiBi), which negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query for efficient extrapolation. Its application on spectral attention can be seen in the figure below

We modify the penalty bias to decrease in a square manner for better performance and name the method as ASqBi, indicated in the figures below.

The modified method is combined with MTFAA MTFAA_Net_full_F_ASqbi.py and the ablation test results can be found in the last row of table below. It can be seen that the overall performance degrades compared with LSA. It may be explained that the negative bias added to local attention region weakens the model capability to extract local intercorrelation.

Full-band Metrics

STOI

SiSDR(dB)

 

LSD(dB)

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Band(kHz)

0~8

8~24

Full.

MTFAA-ASqBi

0.811

0.881

0.860

10.425

15.944

14.468

MTFAA-ASqaBi

10.307

9.495

9.766

MTFAA-LSA

0.809

0.881

0.860

10.347

16.201

14.635

MTFAA-LSA

9.840

8.636

9.037

Wideband Metrics

PESQ

CSIG

CBAK

COVL

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

MTFAA-ASqBi

2.064

2.769

2.564

3.487

4.165

3.968

2.987

3.558

3.392

2.804

3.513

3.308

MTFAA-LSA

2.084

2.795

2.589

3.517

4.203

4.004

3.006

3.593

3.423

2.829

3.547

3.339

 

We also conduct subjective listening preference test on MTFAA model to validate the benifit of LSA mechanism. 50 enhanced samples as well as their reference target speech are randomly selected from the test set. 15 listeners with normal hearing compare the enhanced results based on the reference speech and choose the preferred result. Each sample is evaluated by at least 3 listeners. The subjective listening preference test results can be seen in table below. Over 60% samples enhanced by LSA-based MTFAA are considered to have better perceptual quality and lower noise levels, which demonstrates the efficiency of LSA in full-band SE tasks.

Model

LSA

Preference (%)

MTFAA

38.0

62.0

 

The proposed method also reduces computational complexity in spectral attention and the statistics are given in Table below

Model

Percentage of complexity reduction

in spectral attention (%)

MTFAA

63.2

DPARN

25.4

 

We compare the modified MTFAA model with previous full-band SOTA methods on VCTK-DEMAND dataset and the results are listed in table below

Models

Year

Param.(M)

PESQ

STOI(%)

CSIG

CBAK

COVL

Noisy

-

-

1.97

92.1

3.34

2.44

2.63

RNNoise

2020

0.06

2.33

92.2

3.40

2.51

2.84

PercepNet

2020

8.00

2.73

-

-

-

-

CTS-Net(full)

2020

7.09

2.92

94.3

4.22

3.43

3.62

DCCRN

2020

3.70

2.54

93.8

3.74

3.13

2.75

NSNet2

2021

6.17

2.47

90.3

3.23

2.99

2.90

S-DCCRN

2022

2.34

2.84

94.0

4.03

3.43

2.97

FullSubNet+

2022

8.67

2.88

94.0

3.86

3.42

3.57

GaGNet

2022

5.95

2.94

-

4.26

3.45

3.59

DMF-Net

2022

7.84

2.97

94.4

4.26

3.52

3.62

DS-Net

2022

3.30

2.78

94.3

4.20

3.34

3.48

SF-Net

2022

6.98

3.02

94.5

4.36

3.54

3.67

DeepFilterNet2

2022

2.31

3.08

94.3

4.30

3.40

3.70

MTFAA (Cau., LSA)

2023

1.5

3.16

94.7

4.35

3.61

3.78

MTFAA (Non-cau., LSA)

2023

1.5

3.30

95.3

4.45

3.73

3.90

 

To further investigate the patterns of spectral attention, we firstly plot the attention figures generated from clean speech of male and female,

clean

it can be seen that the patterns are related to speech characteristics that harmonics are basically distributed in low bands (top-left corner of attention plot) and consonants (almost randomly distributed) are in high bands (bottom-right corner of attention plot). It can also be seen that female's pitches are higher than male's with larger intervals. To be clearly, the harmonics-related lines are not parallel to each other for the frequencies are in the ERB scale.

Then we plot the attention figures generated from noisy speech and the attention plots in decoder are given below, where the harmonic-related features are highlighted,

noisy

Ignited by ALiBi, we also conduct experiment on the multi-scale local spectral attention (MSLSA) as shown in the figure below,

the performance of MSLSA can be seen in the table below,

Full-band Metrics

STOI

SiSDR (dB)

 

LSD (dB)

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Band(kHz)

0~8

8~24

Full.

Noisy

0.687

0.805

0.771

-2.515

7.971

5.166

Noisy

18.37

12.38

14.38

MTFAA

0.805

0.876

0.856

10.10

15.74

14.23

MTFAA

10.33

9.349

9.678

MTFAA-LSA

0.809

0.881

0.860

10.34

16.20

14.63

MTFAA-LSA

9.840

8.636

9.037

MTFAA-MSLSA

0.809

0.880

0.859

10.43

15.98

14.50

 

MTFAA-MSLSA

10.03

8.623

9.094

Wideband Metrics

PESQ

CSIG

CBAK

COVL

SNR(dB)

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

-5~0

0~15

Ovrl.

Noisy

1.160

1.446

1.364

2.023

2.719

2.517

1.833

2.481

2.293

1.571

2.095

1.943

MTFAA

1.981

2.669

2.470

3.465

4.113

3.925

2.951

3.523

3.357

2.754

3.436

3.238

MTFAA-LSA

2.084

2.795

2.589

3.517

4.203

4.004

3.006

3.593

3.423

2.829

3.547

3.339

MTFAA-MSLSA

2.077

2.772

2.571

3.500

4.167

3.974

3.013

3.589

3.422

2.820

3.517

3.314

Wideband Metrics

SSNR (dB)

 

SNR(dB)

-5~0

0~15

Ovrl.

 

Noisy

-2.291

4.19

2.307

 

MTFAA

6.550

10.13

9.094

 

MTFAA-LSA

6.609

10.26

9.200

 

MTFAA-MSLSA

6.779

10.38

9.338

 

 

The best performance of MSLSA is slightly worse than that of conventional LSA, while it can be seen from the training process that MSLSA may achieve a more stable result with a higher mean score and lower variance of PESQ in the validation set, which can be seen in the figure below (statistics based on the last 100 epochs)

Average attention plots of different heads in MSLSA are also given and it can be seen that the global attention head (Nl=F') cannot exploit clear spectral features while the local one (Nl=F'/4) makes it.

different_heads