fleek / VADtransciber

31 stars 9 forks source link

Speaker Diarization #2

Open mxzgithub opened 1 year ago

mxzgithub commented 1 year ago

I get a lot of "Speaker?" in the final file and i do not know how to improve this. Maybe you can give a few tips how to work with the pipeline.

fleek commented 1 year ago

Because of my specific requirements, I need speaker identification, do you need that?

mxzgithub commented 1 year ago

This is the main reason i use your transcriber. Thank you for making the pipeline public. I tested various files with 4 to 7 speakers. In one sample i get about 1000 numbered speaker lines and about 200 with question mark. About 15 to 20% of the speaker lines will not be numbered. Do you have any tips how this can be improved?

Further question: Do you change the generic speaker numbers to names later in the process? I considered writing a module to rename the numbered speakers in a file and then rename all the "speaker?" one by one. There are just to many of them to go through at the moment.

fleek commented 1 year ago

Okay the main reason why you get <Speaker?> is because of overlapping conversation, the program is not able to assign one speaker to that chunk. Second reason is because of if the conversation follows one another too closely without enough silence interval then it could not be broken into two conversations then it will become reason number one.

My specific requirement is to document even filler words, so I cannot ignore of non speech sounds, perhaps you can put another layer of SAD (Speech Activity Detection for further filtering).

For my use case I do not have requirements to change the names to actual names, but you can definitely write a module to change the speakers to proper names.

   st = get_speech_timestamps(wav, smodel,
                                              threshold=0.65,                  #0.5,
                                              sampling_rate=16000,
                                              min_speech_duration_ms=5,      #250,
                                              min_silence_duration_ms=100,      #100,
                                              window_size_samples= 1536,      #this is fixed
                                              speech_pad_ms=10,                #30,
                                              return_seconds= False,
                                              visualize_probs=False
                                              )

You can try to tune these parameters. you can lower the threshold, min_speech_duration_ms and min_silence_duration.

You can use the software subtitleedit to further process the final file