NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
10.79k stars 2.26k forks source link

Speaker Diarization goes haywire due to small segments of audio #9523

Open AatikaNazneen opened 5 days ago

AatikaNazneen commented 5 days ago

Describe the bug

I have a long audio of around 3 hours that spans multiple speakers. The speaker diarization label a single speaker when this audio is passed. When I break down into this audio in parts and pass each part separately, some of the parts get assigned speakers correctly but the rest of the portion has the same bug. I identified some 1 min chunks that when added in this audio cause the model to behave this way. I'm seeking possible explanations or solutions to this behavior since I believe that the model should be resilient enough.

Steps/Code to reproduce bug

Test Speaker Diarization on the audio

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment details

Additional context

GPU model