A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
I have a long audio of around 3 hours that spans multiple speakers. The speaker diarization label a single speaker when this audio is passed. When I break down into this audio in parts and pass each part separately, some of the parts get assigned speakers correctly but the rest of the portion has the same bug. I identified some 1 min chunks that when added in this audio cause the model to behave this way. I'm seeking possible explanations or solutions to this behavior since I believe that the model should be resilient enough.
Describe the bug
I have a long audio of around 3 hours that spans multiple speakers. The speaker diarization label a single speaker when this audio is passed. When I break down into this audio in parts and pass each part separately, some of the parts get assigned speakers correctly but the rest of the portion has the same bug. I identified some 1 min chunks that when added in this audio cause the model to behave this way. I'm seeking possible explanations or solutions to this behavior since I believe that the model should be resilient enough.
Steps/Code to reproduce bug
Test Speaker Diarization on the audio
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
Environment details
Additional context
GPU model