alumae / kaldi-offline-transcriber

Offline transcription system for Estonian using Kaldi
Other
226 stars 57 forks source link

about the model files:gender.gmms,s.gmms, sms.gmms, ubm.gmm #15

Closed shunfeichen closed 7 years ago

shunfeichen commented 7 years ago

Hello, Recently,I want to use LIUM and Kaldi for ASR. But when I use LIUM ,the original models(gender.gmms,s.gmms, sms.gmms, ubm.gmm) are trained by French ,and how to train the four models using my English corpus?

alumae commented 7 years ago

See http://www-lium.univ-lemans.fr/diarization/doku.php/gaussian_gmm_training. The documentation is quite sparse, I know.

shunfeichen commented 7 years ago

Yeah,I have seen it. It just write briefly,I don't understand how to use many wav files to train ? I need to merge these wav files to one wav file?

alumae commented 7 years ago

No, the 1st column in the seg file is the file ID, smth like:

file1 1 2934 743 U U U speech
file2 1 3679 379 U U U speech

And you can use the --sInputMask argument of e.g. fr.lium.spkDiarization.programs.MTrainInit to speficify the location of the feature files.

I'll copy/paste my Makefile that I use to train the sms.gmms file. It's out of scope of this project, so I'm not going to give you any support on this:

build/sms_seg/all.seg: build/sms_seg/labels/jutusaated.trainset.seg build/sms_seg/labels/intervjuud.trainset.seg build/sms_seg/labels/ak.trainset.seg build/sms_seg/labels/paevakaja.trainset.seg
        mkdir -p `dirname $@`
        for f in $^; do \
                domain=`basename $$f .trainset.seg`; \
          cat $$f | perl -npe "s/^/$$domain\//"; \
         done > $@

build/data/unsegmented_liumfeat: build/sms_seg/all.seg
        rm -rf $@
        for d in `cat build/sms_seg/all.seg | cut -f 1 -d " "| uniq | perl -npe 's/\/.*//' | uniq`; do \
                mkdir -p $@/$$d; \
        done
        cat build/sms_seg/all.seg | cut -f 1 -d " "| uniq > tmp.fileids
        $(SPHINXTRAIN_BIN)/wave2feat -verbose yes -c tmp.fileids -mswav yes -di build/data/unsegmented_audio -ei wav -do $@ -eo feat 
        rm tmp.fileids

#java -cp ~/tools/LIUM_SpkDiarization.season_2/bin:/home/tanel/tools/LIUM_SpkDiarization.season_2/lib/java-getopt-1.0.13.jar:/home/tanel/tools/LIUM_SpkDiarization.season_2/lib/sphinx4.jar:/home/tanel/tools/LIUM_SpkDiarization.season_2/lib/ejml-0.23.jar -Xmx1024m fr.lium.spkDiarization.programs.MTrainInit 
build/sms_seg/sms.init.gmms: build/data/unsegmented_liumfeat
        java -cp ~/tools/LIUM_SpkDiarization-4.2.jar -Xmx1024m fr.lium.spkDiarization.programs.MTrainInit \
        --fInputMask=./build/data/unsegmented_liumfeat/%s.feat \
        --emInitMethod=split_all --emCtrl=1,5,0.05 \
        --nbComp=64 --kind=DIAG --fInputDesc="sphinx,1:3:2:0:0:0,13,0:0:0" \
        --sInputMask=./build/sms_seg/%s.seg --tOutputMask=$@ all

build/sms_seg/sms.gmms: build/sms_seg/sms.init.gmms
        java -cp ~/tools/LIUM_SpkDiarization-4.2.jar -Xmx1024m fr.lium.spkDiarization.programs.MTrainEM \
        --fInputMask=./build/data/unsegmented_liumfeat/%s.feat  \
        --nbComp=16 --kind=DIAG --fInputDesc="sphinx,1:3:2:0:0:0,13,0:0:0" \
        --emCtrl=1,20,0.01 \
        --sInputMask=./build/sms_seg/%s.seg \
        --tOutputMask=$@ --tInputMask=$^ all
shunfeichen commented 7 years ago

Thank you,very much!!!Gratitude to you can't express in words

shunfeichen commented 7 years ago

Hi,alumae: your Makefile to train sms.gmms, there not exist the file "all.feat",how do you generate it?

alumae commented 7 years ago

Actually, you don't need all.feat, but build/data/unsegmented_liumfeat/%s.feat, where %s gets replaced by LIUM program with the file ID. And the individual feat files are generated using this declaration:

build/data/unsegmented_liumfeat: build/sms_seg/all.seg
        rm -rf $@
        for d in `cat build/sms_seg/all.seg | cut -f 1 -d " "| uniq | perl -npe 's/\/.*//' | uniq`; do \
                mkdir -p $@/$$d; \
        done
        cat build/sms_seg/all.seg | cut -f 1 -d " "| uniq > tmp.fileids
        $(SPHINXTRAIN_BIN)/wave2feat -verbose yes -c tmp.fileids -mswav yes -di build/data/unsegmented_audio -ei wav -do $@ -eo feat 
        rm tmp.fileids
shunfeichen commented 7 years ago

Oh ,Thank you very much!,Actually, I still don't know how to train a ubm model using a lot of audio files.I am really so confused. I know the feat files how to generate,but I don't konw how to use them to train model. In my opinion, I should generate the all.seg and all.feat as parameters for function MTrainInit and MTrainEM. I really don't know how to do it now.