Frame resolution of the Segmentation Model

rohitpaturi commented 1 year ago

First of all, kudos on the great work and thanks for open-sourcing it! I have a question related to the segmentation model's frame resolution. I see that the Inference pipeline provides decisions/posteriors at a frame level. How can I convert the frame-level decisions into temporal decisions. This is mainly needed when operating in the 'whole' mode as below

from pyannote.audio import Inference, Model

model = Model.from_pretrained("powerset_pretrained.ckpt")
inf = Inference(model, window='whole')
result_1 = inf.crop(my_audio_file, Segment(0, 10)) # Number of output frames is 589
result_2 = inf.crop(my_audio_file, Segment(0, 20)) # Number of output frames is 1182

FrenchKrab commented 1 year ago

Thanks !

If what you want is essentially an RTTM file, you want to obtain a pyannote.core Annotation object, which you can get with the Binarize class:

from pyannote.audio.utils.signal import Binarize
from pyannote.core import SlidingWindowFeature, SlidingWindow, Segment, Annotation
# .... your code

seg = Segment(0,20)

model_sw: SlidingWindow = model.example_output.frames
# we need to create the sliding window ourselves, as window="whole" returns a np.array, this shouldn't be necessary with window="sliding"
result_2_sw = SlidingWindow(model_sw.duration, start=seg.start, step=model_sw.step)
swf = SlidingWindowFeature(result_2, result_2_sw)
ann: Annotation = Binarize()(swf)  # Binarize can take onset/offset and other params
print(list(ann.itertracks())) #  [(<Segment(0.109, 5.609)>, 2), ...]

And that should be it. I'm not a hundred percent sure it is correct, and if that's what you want to do, but it seems to do the trick.

Disclaimer (just to be sure) : know that with window="whole", you aren't using the model for what it was trained to do (ie you're using it as a fully EEND when it was trained to work on 5s windows). So you might get poor performance (and you'll need to have at most 3 speakers in your window). The intended way to proceed is to use the pyannote pipeline to stitch the sliding windows back together with clustered speaker identities (the pipeline does everything automatically).

But as far as I know there is currently no easy way to obtain (averaged & aligned) posteriors from the pipeline, so using Inference might indeed be the most convenient option if that's what you want.

xiangzai0115 commented 1 year ago

Thanks for the detailed guidance! One follow-up question: If the swf sent to Binarize() class is with the shape of [chunk_num, frame_num, speaker_num], should we process swf chunk by chunk as I saw the input shape to Binarize() is (num_frames, num_speakers).

FrenchKrab commented 1 year ago

In that case you're dealing with the output of a sliding window, you applied the model num_chunk times to the audio files and got num_chunk 5 seconds outputs. But since we're dealing with the speaker diarization task, these local output might have different permutation of identities.

This explanation is not specific to the powerset paper, this is simply how pyannote.audio approaches speaker diarization, but for example, reusing the image from your previous issue: These are three chunks/windows with overlap. We can clearly see the outputs highlighted should clearly belong to the same speaker, but it doesn't. Since the goal of the local segmentation model is to perform local speaker diarization, it is not concerned with consistent identities, as the SD task is invariant to speaker identities.

That's why we use a "pipeline" to process files longer than 5 seconds : we need the identities of all local chunks to be consistent so that we can stitch/average them back together to get the full output. This is done with an embedding extraction + clustering step.

You can find the full detail of the pipeline in: YANNOTE.AUDIO 2.1 SPEAKER DIARIZATION PIPELINE: PRINCIPLE, BENCHMARK, AND RECIPE

xiangzai0115 commented 1 year ago

That's very clear explanation and make sense. One way I can come up with is I can do the mapping from local posteriors to global one.

rohitpaturi commented 1 year ago

Thanks for the response, this helps

FrenchKrab commented 1 year ago

Closing the issue, reopen if you have more related questions :)

xiangzai0115 commented 1 year ago

That's very clear explanation and make sense. One way I can come up with is I can do the mapping from local posteriors to global one.

@FrenchKrab Regarding this question, if I would like to map local posteriors (segmentation output before binarize) to global speaker id (cluster), does this step help? https://github.com/pyannote/pyannote-audio/blob/0b45103cb228a81a9d9d776cca92694cb30ddb41/pyannote/audio/pipelines/speaker_diarization.py#L416

Btw, is there any way we can convert powerset multi-class posteriors to vanilla multi-class posteriors? As I saw the output from powerset segmentation model is already binarized ones.

I'm appreciated your help.

xiangzai0115 commented 1 year ago

For the second Q, I think I can set the param 'soft' to be True in https://github.com/pyannote/pyannote-audio/blob/0b45103cb228a81a9d9d776cca92694cb30ddb41/pyannote/audio/utils/powerset.py#L87`

FrenchKrab / IS2023-powerset-diarization

Frame resolution of the Segmentation Model #4