bugbakery / audapolis

an editor for spoken-word audio with automatic transcription
GNU Affero General Public License v3.0
1.69k stars 40 forks source link

Speaker diarization with pyannote.audio? #366

Open hbredin opened 2 years ago

hbredin commented 2 years ago

I am the creator of pyannote.audio speaker diarization toolkit.

I understand that you went with @josepatino's PyBK because of its speed but I'd love to see pyannote.audio pretrained pipeline integrated into audapolis.

Would that be of interest to you? I'd love to help in some way!

pajowu commented 2 years ago

Hey, thanks for reaching out. We would be interested in integrating this 100%. When I started looking at speaker diarization I also noticed pyannote.audio, but as there was no pretrained pipeline at the time we decided against it.

pajowu commented 2 years ago

Do you think it would be possible to extend the SpeakerDiarization-pipeline to not only report the individual steps of the pipeline via the hook, but also the progress within certain steps? This would be a huge benefit for us

hbredin commented 2 years ago

I have been meaning to add this kind of progress hook for the online demo but it never really reached the top of my priority list.

Those are the two steps that really make most of the processing time:

hbredin commented 2 years ago

FYI, I just released a much faster/more accurate version of pyannote.audio speaker diarization pipeline. It still does not expose the progress of the individual steps but this is now on my TODO list (though with no ETA).

pajowu commented 1 year ago

Wow, I just tried it (and opened https://github.com/pyannote/pyannote-audio/pull/1185/files for the progress). The results are really impressive 😍

pajowu commented 1 year ago

I started integrating it and stumbled upon a problem which I'm currently not sure how to solve, so if you have any idea @hbredin, I would be very interested: audapolis currently works on the assumption that there is only one "speaker" at any time. pyannote-audio on the other hand supports multiple speakers at the same time. It therefor produces overlaps between the speakers.

Since changing audapolis to support multiple speakers is too much for now, I'm trying to "flatten" the output of pyannote-audio to 1 speaker at a time. Do you have a suggestion on how to do that properly?

hbredin commented 1 year ago

Nothing built in pyannote comes to mind.

You'd have to postprocess the pyannote.core.Annotation instance returned by the pipeline:

  1. remove any segment fully contained by a larger segment
[------A-------]   ==> [------A-------]
    [--B--]
  1. split partially overlapping segments in two halves
[----A----]     ==> [----A--]
     [----B----]            [--B----]

Or you could clip the output of the speaker counting step to be at most 1.

count.data = np.clip(count.data, 0, 1)

should do the trick...