gabrielmittag / NISQA

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
MIT License
663 stars 117 forks source link

MOS results change depending on the audio sequence #12

Closed annahung31 closed 3 years ago

annahung31 commented 3 years ago

Hi,

I got an interesting observation when using the default MOS predictor. When I'm testing on a sample with packet-loss, I got the MOS score of 1.66996. The waveform is as below:

image

Then I cut the first half of the waveform that encountered packet-loss and paste it to the end of the waveform as shown below: image I got the MOS score of 2.3836, which gets 0.7136 improvement related to the original audio.

But for a human, these two audios are basically the same, so the MOS prediction should not have this much difference...

In the paper the authors mentioned that there are several ways of pooling to deal with the time-related information. I'm thinking that maybe that's the reason? Do you have any suggestion to avoid this kind of situation?

Thanks!

gabrielmittag commented 3 years ago

Hi,

Thank you, that is actually interesting. You are right, this is probably related to the pooling and self-attention layers. However, I don't think the difference is that large - in the end, both samples are rated with a MOS of around 2. The model cannot really differentiate between small differences in perceived speech quality and the output is always a bit random. As soon as you change a small thing in the signal the output of the model will also slightly change. I am also surprised it changes that much only because of the sequence order though.

Actually, changing the order of sequences could be an interesting approach for data augmentation while training the model. But unfortunately, I do not have a suggestion on how to avoid it. You could just take the average of both outputs? Generally, if you want to evaluate the quality of a certain condition, I would recommend processing several clean speech samples with the same distortion and then take the average across all files as per-condition MOS. The results will be much more accurate than on a single file.

Best, Gabriel