Experiment with Speaker Diarization

From manual inspection, we have interviews, university lectures, plays, spoken histories, city hall proceedings, and other forms of labeled audio where there are multiple speakers.

This is a interesting because it is well-known that ASR systems struggle when there are multiple speakers.

It would be good to get a handle on how many speakers are in each of our audio tracks, as well as when they are speaking. I'm not quite sure how well the state-of-the-art is for this subfield (known as speaker diarization) as a start. People did this via k-means clustering in the past, where each speaker is a cluster, but k (the number of speakers) is a hyperparameter. I'm not sure if there are better mechanisms for this today for automatically proposing the number of speakers.

This is very open-ended, and I'm out of my depth here.

galv / lingvo-copy

Experiment with Speaker Diarization #16