Closed kehuantiantang closed 2 years ago
I have the same question. I did not find any range and description for these quality-based parameters in the repo. That would be great to add them in readme.
Hi,
For the overall quality MOS and the four quality dimensions the range is [1, 5] where 1 is poor quality and 5 is excellent quality. BTW - the quality dimensions (Noisiness, Coloration, Discontinuity, Loudness) cannot be used for synthetic speech. To predict the Naturalness of synthetic speech use the nisqa_tts.tar weights
Let me know if anything is still unclear. I'll try to add some more info to the readme or in the wiki.
@gabrielmittag Hi, sorry to bring this up again, but I just wanted to further clarify the score when it comes to the loudness dimension. For example, if I get a 1 in loudness, does that mean the speech is too quiet, or does it mean the speech is so loud that it is peaking which is bad in a different way?
For noisiness, coloration and discontinuity I take it that we have less of each the closer to 5 we are. I.e. an audio clip that is not very noisy is closer to 5.
Hi,
That's correct for noisiness, coloration and discontinuity. For Loudness the score represents how optimal the loudness is, that means a sample with non-optimal loudness (either too loud or too quiet) will be rated with a lower score.
Here is a brief explanation of the different dimensions:
The following graph shows the average loudness predictions of the model vs the active speech level in dBov. The optimal level is around -26 dBov because most samples in the dataset were normalized for that level (apart from the ones with non-optimal loudness on purpose).
Figure source: https://link.springer.com/book/10.1007/978-3-030-91479-0
Thrnak you so much for your contribution. There has five speech quality dimeansons output by inference the network. I wounder to make sure whether my understand is correct for some metrics.
Thank you so much for your help. Have a nice day.