Herein, we propose a novel, end-to-end, and non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory (BLSTM). In addition, to prevent Quality-Net from becoming an incomprehensible black box, its structure is designed to automatically learn (infer) a reasonable frame-level quality. This gives Quality-Net the ability to locate the degraded regions in an utterance. Although our ultimate goal is to learn the mapping function of the human listening perception, an off-the-shelf data set with labels that meets our requirements does not exist (here, we focus on predicting the quality of noisy speech and enhanced speech given by a deep-learning-based speech enhancement model). Therefore, we apply Quality-Net to predict the PESQ scores without a clean reference.
Quality-Net is the first "end-to-end", and "non-intrusive" quality assessment model (as shown in Fig. 1) to yield "frame-level" quality.
For more details and evaluation results, please check out our paper.
If you find the code useful in your research, please cite:
@inproceedings{fu2018quality,
title={Quality-Net: An end-to-end non-intrusive speech quality assessment model based on blstm},
author={Fu, Szu-Wei and Tsao, Yu and Hwang, Hsin-Te and Wang, Hsin-Min},
booktitle={Interspeech},
year={2018}}
e-mail: jasonfu@iis.sinica.edu.tw or d04922007@ntu.edu.tw