gabrielmittag / NISQA

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
MIT License
683 stars 117 forks source link

What is the best performance for overall quality ? Higher? Lower? And the range ? #21

Closed kehuantiantang closed 2 years ago

kehuantiantang commented 2 years ago

Thrnak you so much for your contribution. There has five speech quality dimeansons output by inference the network. I wounder to make sure whether my understand is correct for some metrics.

For better synthetic speech
MOS: higher, [0, 5]
Noiseness: lower, range ?
Coloration: higher, range ?
Discontinuity: lower, range ?
Loudness: higher, range ?

Thank you so much for your help. Have a nice day.

nattaran commented 2 years ago

I have the same question. I did not find any range and description for these quality-based parameters in the repo. That would be great to add them in readme.

gabrielmittag commented 2 years ago

Hi,

For the overall quality MOS and the four quality dimensions the range is [1, 5] where 1 is poor quality and 5 is excellent quality. BTW - the quality dimensions (Noisiness, Coloration, Discontinuity, Loudness) cannot be used for synthetic speech. To predict the Naturalness of synthetic speech use the nisqa_tts.tar weights

Let me know if anything is still unclear. I'll try to add some more info to the readme or in the wiki.

StianHanssen commented 1 year ago

@gabrielmittag Hi, sorry to bring this up again, but I just wanted to further clarify the score when it comes to the loudness dimension. For example, if I get a 1 in loudness, does that mean the speech is too quiet, or does it mean the speech is so loud that it is peaking which is bad in a different way?

For noisiness, coloration and discontinuity I take it that we have less of each the closer to 5 we are. I.e. an audio clip that is not very noisy is closer to 5.

gabrielmittag commented 1 year ago

Hi,

That's correct for noisiness, coloration and discontinuity. For Loudness the score represents how optimal the loudness is, that means a sample with non-optimal loudness (either too loud or too quiet) will be rated with a lower score.

Here is a brief explanation of the different dimensions:

image

The following graph shows the average loudness predictions of the model vs the active speech level in dBov. The optimal level is around -26 dBov because most samples in the dataset were normalized for that level (apart from the ones with non-optimal loudness on purpose).

image

Figure source: https://link.springer.com/book/10.1007/978-3-030-91479-0