Closed manjunath7472 closed 3 months ago
Yep, I got the same error, have you found the issue?
Yet not solution?
I have the same issue and investigated. It appears that the "speaker 0" for all lines is the direct output of the underlying diarization model, Nemo Toolkits: nemo.collections.asr.models.msdd_models.NeuralDiarizer. So there is a bug in the nemo toolkit, not this library. We all might be better off trying to use pyannote for the diarization.
I found that the problem come from model quality of Nemo: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic
Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.
Had the same with demucs. Disabling it (--no-stem) helped.
Below is audio file to reproduce the issue. Audio.
Actual output.