[Bug]: No guarantee that assigned word-level speaker is same as assigned utterance-level speaker

Contact Details

No response

What happened?

For some reason, the speaker assigned to a given word is not necessarily the same as the speaker assigned to the utterance that word is part of. You can see this in the test data for the utterance "I only need a few months." The utterance-level speaker is "SPEAKER_00" but the word-level speaker for "I" is "SPEAKER_01" (see logs)

The consequences of this are unclear, but documenting the bug in case it becomes a future issue. In some future release it would be nice to have some kind of sanity check on the output to ensure that the subdivision speaker assignments are consistent with their parent assignment.

What operating system are you using?

Ubuntu

Relevant log output

{"start": 8.923, "end": 9.96, "text": " I only need a few months.", "words": [{"word": "I", "start": 8.923, "end": 9.147, "score": 0.865, "speaker": "SPEAKER_01"}, {"word": "only", "start": 9.167, "end": 9.33, "score": 0.513, "speaker": "SPEAKER_00"}, {"word": "need", "start": 9.35, "end": 9.472, "score": 0.895, "speaker": "SPEAKER_00"}, {"word": "a", "start": 9.492, "end": 9.512, "score": 0.989, "speaker": "SPEAKER_00"}, {"word": "few", "start": 9.553, "end": 9.695, "score": 0.929, "speaker": "SPEAKER_00"}, {"word": "months.", "start": 9.736, "end": 9.96, "score": 0.575, "speaker": "SPEAKER_00"}], "speaker": "SPEAKER_00"},

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Forced-Alignment-and-Vowel-Extraction / fave-asr