Closed abramovi closed 4 years ago
Hi @abramovi,
Lexicon file is using to map words to the tokens sequence for the target transcription. And we learn probabilities for each each token for each frame in case of CTC. If we meet a word (from train and valid lists) which is not listed in the lexicon we don't know how to map it to the tokens set. In the w2l we use letters sequence to map the word. But in this case all these letters should be in the tokens set. So my guess that you have some word which is absent in the lexicon and whose letters are not all in the tokens file. Please check this.
Often construct lexicon from all words from the train and valid transcriptions.
Thank you @tlikhomanenko for your answer.
I rebuild my lexicon - have words in it.
here is my current full log:
root@8170d4db421f:~/wav2letter/build# ./Train train --flagsfile /data/train.cfg I0224 21:35:40.914047 84 Train.cpp:141] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=fals e; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_deco der_tr_layers=1; --arch=network.arch; --archdir=/data; --attention=content; --attentionthreshold=2147483647; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=asg; --critoptim=sgd; --datadir=/data/data/output; --dataord er=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=false; --e ncoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=/data/train.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=wav; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=25; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data /data/output/am/lexicon.txt; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0 ; --logadd=false; --lr=0.10000000000000001; --lrcosine=false; --lrcrit=0; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --ma xsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz =0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=4; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --opt imrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=1; --reportiters=0; --rightWindowSize=50; --r ndv_filepath=; --rundir=/data/data/; --runname=librispeech_clean_trainlogs; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --show letters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=|; --tag=; --ta rget=tkn; --test=; --tokens=tokens.txt; --tokensdir=/data/data/output/am; --train=lists/train.lst.fix; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false ; --uselexicon=true; --usewordpiece=false; --valid=lists/valid.lst.fix; --weightdecay=0; --wordscore=0; --wordseparator=|; --world_rank=0; --world_size=1; --alsologtoemail=; --also logtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaille vel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0 ; --vmodule=; I0224 21:35:40.914216 84 Train.cpp:142] Experiment path: /data/data/librispeech_clean_trainlogs I0224 21:35:40.914271 84 Train.cpp:143] Experiment runidx: 1 I0224 21:35:40.922464 84 Train.cpp:187] Number of classes (network): 29 I0224 21:35:41.045508 84 Train.cpp:194] Number of words: 54054 I0224 21:35:41.068939 84 Train.cpp:208] Loading architecture file from /data/network.arch I0224 21:35:41.077917 84 Train.cpp:240] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output] (0): View (-1 1 40 0) (1): Conv2D (40->256, 8x1, 2,1, SAME,SAME, 1, 1) (with bias) (2): ReLU (3): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (4): ReLU (5): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (6): ReLU (7): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (8): ReLU (9): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (10): ReLU (11): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (12): ReLU (13): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (14): ReLU (15): Conv2D (256->256, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (16): ReLU (17): Reorder (2,0,3,1) (18): Linear (256->512) (with bias) (19): ReLU (20): Linear (512->29) (with bias) I0224 21:35:41.077971 84 Train.cpp:241] [Network Params: 3900445] I0224 21:35:41.077986 84 Train.cpp:242] [Criterion] AutoSegmentationCriterion I0224 21:35:41.078006 84 Train.cpp:250] [Network Optimizer] SGD I0224 21:35:41.078037 84 Train.cpp:251] [Criterion Optimizer] SGD Aborted at 1582580142 (unix time) try "date -d @1582580142" if you are using GNU date PC: @ 0x5ca492 w2l::wrd2Target() SIGSEGV (@0x8) received by PID 84 (TID 0x7f56544bdbc0) from PID 8; stack trace: @ 0x7f5649f6c390 (unknown) @ 0x5ca492 w2l::wrd2Target() @ 0x5cbfeb w2l::wrd2Target() @ 0x61ddc3 w2l::W2lListFilesDataset::loadListFile() @ 0x61e859 w2l::W2lListFilesDataset::W2lListFilesDataset() @ 0x62fb1a w2l::createDataset() @ 0x41ac91 main @ 0x7f56490e1830 __libc_start_main @ 0x48ce49 _start @ 0x0 (unknown) Segmentation fault
I am not sure that it is related to my lexicon as I use prepare-data.py to build it using my all data.
any idea how to find out which word it missing ?
Could you attach your tokens set file and your generated lexicon? I want to have a look at them to make sure they are fine.
thank you !!!! issue was as you said related to digits in my lexicon and tokens files
Hi @abramovi,
Lexicon file is using to map words to the tokens sequence for the target transcription. And we learn probabilities for each each token for each frame in case of CTC. If we meet a word (from train and valid lists) which is not listed in the lexicon we don't know how to map it to the tokens set. In the w2l we use letters sequence to map the word. But in this case all these letters should be in the tokens set. So my guess that you have some word which is absent in the lexicon and whose letters are not all in the tokens file. Please check this.
Often construct lexicon from all words from the train and valid transcriptions.
what should i do when i have some words not in lexicons/tokens but it appears in train/val text and i dont want put it to lexicons/tokens files?
Do you want to skip these words during training at all?
Do you want to skip these words during training at all?
Yes, should i replace these words by
Do you want to skip these words during training at all?
i skip all samples have unknown words, but i have a another problem issues
We have fallback to letters for unknown words, in case of letters tokens only any word will be present in training/val transcription anyway (skipping only unknown letters). So you need to preprocess your list to have skipped necessary words before running Train binary.
Hi.
I am trying to train using librispeech recipe on my own small data set.
I am using the following config:
and I amm getting the following: