alumae / kaldi-offline-transcriber

Offline transcription system for Estonian using Kaldi
Other
226 stars 57 forks source link

Question about diarization #9

Closed vince62s closed 8 years ago

vince62s commented 8 years ago

Hi Tanel, I got a question regarding your diarization.sh script

in the lines below, these files are language independent ? are they coming with the package from Lium ? if not how do we generate them ? thanks

define the directory where the results will be saved

datadir=dirname $uem

define where the UBM GMM is

ubm=models/ubm.gmm

define where the speech / non-speech set of GMMs is

pmsgmm=./model/sms.gmms

pmsgmm=models/sms.gmms

define where the silence set of GMMs is

sgmm=models/s.gmms

define where the gender and bandwidth set of GMMs (4 models) is

(female studio, male studio, female telephone, male telephone)

ggmm=models/gender.gmms

alumae commented 8 years ago

The speaker diarization models are technically language dependent and they have been trained on Estonian broadcast data. But they could work with moderate accuracy for other languages.

vince62s commented 8 years ago

thank you for your answer. I must be missing some concepts.

Basically, did you follow these steps with your own data http://www-lium.univ-lemans.fr/diarization/doku.php/gaussian_gmm_training I don't really understand how to feed the training with audio / SID info to obtain the gender.gmms s.gmms sms.gmms ubm.gmm files ....

also, in the make file I see some reference to the sr08 recipe. how is this linked to the diarization ? I don't really understand how to create the "models" that would say this speaker is Bill Gates.

sorry if this sounds too basic but I don't see any tutorial on Diarization nor SID.

alumae commented 8 years ago

Yes, I followed the above script on Estonian broadcast data. Note that the gender.gmms and ubm.gmms models are actually not used anywhere in the diarization process. I think they were used before but they are not needed any more. Sorry about the confusion.

The steps from the SRE08 recipe are used for speaker ID. Diarization clusters the speech segments, and SID assigns the clusters a name. You need SID models to do that. My SID models are trained on around 500 most common speakers from my Estonian broadcast data -- I can do that because the transcriptions contain speaker names.

vince62s commented 8 years ago

I think I understand the whole process now (well in theory ...) For the diarization training as well as for the sre08 recipe, do you know how much audio data is needed ?

In fact what confuses me is that, during the diarization process, with the adequate speaker info, the speaker id could be done at that time, right ?

alumae commented 8 years ago

I'm using around 100 hours of audio. It's a good amount for diarization. For speaker ID, I think it's enough if you have around 20 minutes for each speaker who you want to ID (but I'm no expert on SID). Yes, speaker ID could be done using LIUM tools, I think (haven't tried).

vince62s commented 8 years ago

Another question on diarization .... I see in the code that segments longer than 20 seconds are split ("useful for transcription") why is it so useful for transcription ? Also what is the rule exactly ? where does make the cut when a segment is for instance 35 seconds ?

related question : beside the change of speaker, what makes the segment cut ? a " x ms" silence ? I don't see that clearly in the code ?

thanks. Vincent

alumae commented 8 years ago

Speech recognition is done by first generating lattices and then recoring them using a bigger language model. If the segments were very long, the lattices would be very large (larger than several smaller lattices), and rescoring would blow up memory. But you can test, 35 secs might still be OK.

I don't know all the details about how the diarization works, I pretty mich use it as a black box.

vince62s commented 8 years ago

thanks for your feedback. Isee. I tested by replacing 2000 by 6000 in the code below. do you know what these 2 parameters mean ? --sSegMaxLen=2000 --sSegMaxLenModel=2000 first one seem obvious (even though I don't see why 2000 means 20 sec I wold have thought 20000 ms). What is the second one ? Maybe I need to ask someone at Lium.

Split segments longer than 20s (useful for transcription)

splseg=./$datadir/$show.spl.seg $java -Xmx1024m -classpath "$LOCALCLASSPATH" fr.lium.spkDiarization.tools.SSplitSeg --help \ --sFilterMask=$pmsseg --sFilterClusterName=iS,iT,j --sInputMask=$adjseg --sSegMaxLen=2000 --sSegMaxLenModel=2000 \ --sOutputMask=$splseg --fInputMask=$features --fInputDesc=audio2sphinx,1:3:2:0:0:0,13,0:0:0 --tInputMask=$sgmm $show