galv / lingvo-copy

Apache License 2.0
4 stars 0 forks source link

Experiment with Speaker Diarization #16

Open galv opened 3 years ago

galv commented 3 years ago

From manual inspection, we have interviews, university lectures, plays, spoken histories, city hall proceedings, and other forms of labeled audio where there are multiple speakers.

This is a interesting because it is well-known that ASR systems struggle when there are multiple speakers.

It would be good to get a handle on how many speakers are in each of our audio tracks, as well as when they are speaking. I'm not quite sure how well the state-of-the-art is for this subfield (known as speaker diarization) as a start. People did this via k-means clustering in the past, where each speaker is a cluster, but k (the number of speakers) is a hyperparameter. I'm not sure if there are better mechanisms for this today for automatically proposing the number of speakers.

This is very open-ended, and I'm out of my depth here.