A PyTorch implementation of DeepVQE described in DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation.
DeepVQE is a speech enhancement (SE) model proposed by Microsoft for joint echo cancellation, noise suppression and dereverberation, which outperforms the top 1 models in both 2023 DNS Challenge and 2023 AEC Challenge.
DeepVQE utilizes the U-Net architecture as backbone, while makes some improvements:
We implement DeepVQE aiming to compare its SE performance with other two SOTA SE models, DPCRN and TF-GridNet. To this end, We modify some experimental setup in the original paper, specifically:
We are also interested in the inference speed presented in the paper, i.e, a relatively fast speed of 3.66 ms per frame in spite of its large complexity. So we also provide a stream version of DeepVQE, which is utilized to evaluate its inference speed.
einops
numpy
onnx
onnxruntime
onnxsim
ptflops
torch==1.11.0
We are sorry to find that DeepVQE outperforms DPCRN only with a very limited margin, while requirng for much more computational resources (see below). Besides, DeepVQE lags behind TF-GridNet by a relatively large margin in terms of SE performance. | Model | Param. (M) | FLOPs (G) |
---|---|---|---|
DPCRN | 0.81 | 3.73 | |
TF-GridNet | 1.60 | 22.23 | |
DeepVQE | 7.51 | 8.04 |
We are surprised to find that although DeepVQE requires for large computational resources, it achieves a relatively good real-time factor of 0.2, which corresponds to the data presented in the paper.