Closed rohitpaturi closed 1 year ago
Thanks !
If what you want is essentially an RTTM file, you want to obtain a pyannote.core Annotation
object, which you can get with the Binarize
class:
from pyannote.audio.utils.signal import Binarize
from pyannote.core import SlidingWindowFeature, SlidingWindow, Segment, Annotation
# .... your code
seg = Segment(0,20)
model_sw: SlidingWindow = model.example_output.frames
# we need to create the sliding window ourselves, as window="whole" returns a np.array, this shouldn't be necessary with window="sliding"
result_2_sw = SlidingWindow(model_sw.duration, start=seg.start, step=model_sw.step)
swf = SlidingWindowFeature(result_2, result_2_sw)
ann: Annotation = Binarize()(swf) # Binarize can take onset/offset and other params
print(list(ann.itertracks())) # [(<Segment(0.109, 5.609)>, 2), ...]
And that should be it. I'm not a hundred percent sure it is correct, and if that's what you want to do, but it seems to do the trick.
Disclaimer (just to be sure) : know that with window="whole"
, you aren't using the model for what it was trained to do (ie you're using it as a fully EEND when it was trained to work on 5s windows). So you might get poor performance (and you'll need to have at most 3 speakers in your window).
The intended way to proceed is to use the pyannote pipeline to stitch the sliding windows back together with clustered speaker identities (the pipeline does everything automatically).
But as far as I know there is currently no easy way to obtain (averaged & aligned) posteriors from the pipeline, so using Inference
might indeed be the most convenient option if that's what you want.
Thanks for the detailed guidance! One follow-up question: If the swf sent to Binarize() class is with the shape of [chunk_num, frame_num, speaker_num], should we process swf chunk by chunk as I saw the input shape to Binarize() is (num_frames, num_speakers).
In that case you're dealing with the output of a sliding window, you applied the model num_chunk
times to the audio files and got num_chunk
5 seconds outputs. But since we're dealing with the speaker diarization task, these local output might have different permutation of identities.
This explanation is not specific to the powerset paper, this is simply how pyannote.audio approaches speaker diarization, but for example, reusing the image from your previous issue: These are three chunks/windows with overlap. We can clearly see the outputs highlighted should clearly belong to the same speaker, but it doesn't. Since the goal of the local segmentation model is to perform local speaker diarization, it is not concerned with consistent identities, as the SD task is invariant to speaker identities.
That's why we use a "pipeline" to process files longer than 5 seconds : we need the identities of all local chunks to be consistent so that we can stitch/average them back together to get the full output. This is done with an embedding extraction + clustering step.
You can find the full detail of the pipeline in: YANNOTE.AUDIO 2.1 SPEAKER DIARIZATION PIPELINE: PRINCIPLE, BENCHMARK, AND RECIPE
That's very clear explanation and make sense. One way I can come up with is I can do the mapping from local posteriors to global one.
Thanks for the response, this helps
Closing the issue, reopen if you have more related questions :)
That's very clear explanation and make sense. One way I can come up with is I can do the mapping from local posteriors to global one.
@FrenchKrab Regarding this question, if I would like to map local posteriors (segmentation output before binarize) to global speaker id (cluster), does this step help? https://github.com/pyannote/pyannote-audio/blob/0b45103cb228a81a9d9d776cca92694cb30ddb41/pyannote/audio/pipelines/speaker_diarization.py#L416
Btw, is there any way we can convert powerset multi-class posteriors to vanilla multi-class posteriors? As I saw the output from powerset segmentation model is already binarized ones.
I'm appreciated your help.
For the second Q, I think I can set the param 'soft' to be True in https://github.com/pyannote/pyannote-audio/blob/0b45103cb228a81a9d9d776cca92694cb30ddb41/pyannote/audio/utils/powerset.py#L87`
First of all, kudos on the great work and thanks for open-sourcing it! I have a question related to the segmentation model's frame resolution. I see that the Inference pipeline provides decisions/posteriors at a frame level. How can I convert the frame-level decisions into temporal decisions. This is mainly needed when operating in the 'whole' mode as below