Xiaobin-Rong / deepvqe

An unofficial implementation of DeepVQE proposed by Microsoft Corp.
70 stars 19 forks source link
speech-enhancement

DeepVQE

A PyTorch implementation of DeepVQE described in DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation.

About DeepVQE

DeepVQE DeepVQE is a speech enhancement (SE) model proposed by Microsoft for joint echo cancellation, noise suppression and dereverberation, which outperforms the top 1 models in both 2023 DNS Challenge and 2023 AEC Challenge.

DeepVQE utilizes the U-Net architecture as backbone, while makes some improvements:

Our purpose

We implement DeepVQE aiming to compare its SE performance with other two SOTA SE models, DPCRN and TF-GridNet. To this end, We modify some experimental setup in the original paper, specifically:

We are also interested in the inference speed presented in the paper, i.e, a relatively fast speed of 3.66 ms per frame in spite of its large complexity. So we also provide a stream version of DeepVQE, which is utilized to evaluate its inference speed.

Requirements

einops
numpy
onnx
onnxruntime
onnxsim
ptflops
torch==1.11.0

Results

1. SE performance

We are sorry to find that DeepVQE outperforms DPCRN only with a very limited margin, while requirng for much more computational resources (see below). Besides, DeepVQE lags behind TF-GridNet by a relatively large margin in terms of SE performance. Model Param. (M) FLOPs (G)
DPCRN 0.81 3.73
TF-GridNet 1.60 22.23
DeepVQE 7.51 8.04

2. Inference speed

We are surprised to find that although DeepVQE requires for large computational resources, it achieves a relatively good real-time factor of 0.2, which corresponds to the data presented in the paper.