Question regarding the speech quality dimensions

gabrielmittag / NISQA

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment

MIT License

663 stars 117 forks source link

Hi,

Thank you for your hard work, I have a few question, would be great if you could find some time to answer them.

nisqa_tts.tar : Does this model take into account discontinuity in synthesized speech (long silence regions within a sentence)?
nisqa_tts.tar : Does this model take into account gibberish (partially failed synthesis, some random gibberish synthesized)?
nisqa.tar : Is it okay to use this model for evaluating the above mentioned discontinuity in synthesized speech? What is the level of discontinuity that the model can effectively evaluate? (for example random 2-5 second silence within a sentence)
nisqa.tar : For the predicted values of the various dimensions, is the scale 0-5? And are all the predicted values follow 'higher is better' order?

Thanks Nabarun

Hi,

It will take short discontinuities into account but I don't think it could recognize long silence regions as degradation. I have not looked into it though.
The model rates the Naturalness of synthesized speech. So if the gibberish sounds unnatural it should be recognized by the model as a degradation.
The model is focused on transmitted speech, so the discontinuities the model is trained on are rather short 20 to maybe 200 ms. Longer silent segments would only be detected if they occur in the middle of a word and introduce a perceivable interruption. This model is not trained on synthesized speech, so I am not sure how it would behave in that case.
The outputs are MOS values from 1-5, where 1 is poor and 5 is excellent.

Best, Gabriel

gabrielmittag / NISQA