gabrielmittag / NISQA

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
MIT License
663 stars 117 forks source link

The predict result seems not reliable #41

Closed JohnHerry closed 9 months ago

JohnHerry commented 9 months ago

Hi, thanks for the job. I am searching for a tool to filter bad audio from ASR corpus to get TTS dataset. I had tried this one, and what I concern about is the noise_pred, and discontinuity_pred, I am using this tool on 16K audios so I ignored the col_pred. The test result is frustrating. I checked some samples and their scores, it seems no better then random. Is the model trained on 48K samples? should we train a 16K version?

gabrielmittag commented 9 months ago

The results should be reasonable unless there is something different in your data. The model is trained to predict quality of speech transmitted through voice calls so it might not work on an ASR dataset. The model is able to predict the quality of 16K, they will only be rated slightly lower than 48K. In general the overall MOS is more accurate than the dimension predictions so you could try to rely on the overall MOS only.

JohnHerry commented 9 months ago

The results should be reasonable unless there is something different in your data. The model is trained to predict quality of speech transmitted through voice calls so it might not work on an ASR dataset. The model is able to predict the quality of 16K, they will only be rated slightly lower than 48K. In general the overall MOS is more accurate than the dimension predictions so you could try to rely on the overall MOS only.

Thansk for the help. But my result is not good, the noise lower sample may listens cleaner then the higher. Is it because I am testing on Mandarin dataset?

gabrielmittag commented 9 months ago

It's hard to say without having the data. Is it a public set? There were Mandarin samples in the training set - not that many, but that should not be the issue.