Closed phecda-xu closed 5 years ago
Hi,
-am
flag, the program is killed in constructing trie step. Can you show me the lexicon file you are using? Is 七十二岁
a valid word in it with spelling 七 十 二 岁
?If yes, will 七十二岁
be a valid word in your LM? -emission_dir
flag, you should not run into the branch I0221 11:26:01.354352 6313 Decode.cpp:145] [Serialization] Running forward pass ...
, because forward pass will only be called when you want to use the acoustic model instead of emission set. So, please check your input flags again.Hi,
* When using `-am` flag, the program is killed in constructing trie step. Can you show me the lexicon file you are using? Is `七十二岁` a valid word in it with spelling `七 十 二 岁`?If yes, will `七十二岁` be a valid word in your LM? * When using `-emission_dir` flag, you should not run into the branch `I0221 11:26:01.354352 6313 Decode.cpp:145] [Serialization] Running forward pass ...`, because forward pass will only be called when you want to use the acoustic model instead of emission set. So, please check your input flags again.
Thanks for your reply!
for -am
flg
well,lexicon.txt is like this:
一幅 一 幅
观赏 观 赏
时节 时 节
七十二岁 七 十 二 岁
弟弟 弟 弟
时贴 时 贴
启事 启 事
哥哥 哥 哥
找回 找 回
三十多年 三 十 多 年
不便 不 便
...
and LM is like this:
-5.996684 种出 -0.080535516
-4.73482 一幅 -0.08924663
-4.6979203 观赏 -0.12459301
-4.9250655 时节 -0.23023589
-5.537371 七十二岁 -0.1275499
-4.3757763 弟弟 -0.16132338
-5.996684 时贴 -0.080535516
-5.6808887 启事 -0.080535516
...
corpus files to generate LM is like this:
十几根 鎏金 石雕 圆柱 承托 着 两层 圆形 屋顶 远看 颇似 我国 传统 的 重檐 圆亭
几个 月 来 哈利法 埃米尔 一直 未 前往 埃米尔 宫 治理 朝政 王宫 事务 实际上 已 由 哈马德 主持
开发区 以 行业 规划 分为 工业区 商贸 区 旅游区 生活区 和 公共建筑 区
七十二岁 的 黄浦区 第一 饮食 公司 退休职工 沈 治平 身板 硬朗 因为 老伴 卧床 养病 便成 了 高龄 马 大嫂
据 认为 阿育王 于 公元前 二 四九年 建 在 蓝毗尼 的 这根 石柱 是 他 用来 标志 释迦牟尼 诞生 处 的
硬骨头六连 历任 主管 战斗英雄 和 转业 退伍 干部战士 代表 应邀 坐在 嘉宾 席上
阿拉法特 的 夫人 苏哈抱 着 不满 五个 月 的 女儿 扎哈瓦 在 耶稣 诞生 的 马槽 旁 留影
If a word is unavailable(Word not in LM and lexicon.txt),it will warn “skip unknown tokens ##”
. I have noticed this phenomenon before,and have rebuilded .wrd
and .tkn
files to avoid this. The warning information about unknown tokens is disappeared now. So i think,maybe other things lead this error,do you have any ideas?
Well, i'd like to rebuild the LM and lexicon.txt to check it again!
for -emission_dir
flg
well, I am sure that I have changed the -am
flg to -emission_dir
flg.However, it is strange that once i do decode operation, -am
flg will be setting. Then,I directly set -am=
and -emission_dir=/../../
and flg information shows that setting right this time.
But, just like -am
,it killed again.
./wav2letter/build/Decoder decoder --flagsfile ./wav2letter/tutorials/THCH/decode.cfg
I0222 01:47:14.797297 6401 Decode.cpp:111] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --arch=network.arch; --archdir=./wav2letter/tutorials/THCH/; --attention=content; --attnWindow=no; --batchsize=16; --beamscore=15; --beamsize=100; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/data/wav2letter++/THCH/; --dataorder=input; --devwin=0; --emission_dir=/data/wav2letter++/THCH/; --enable_distributed=false; --encoderdim=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=40; --flagsfile=./wav2letter/tutorials/THCH/decode.cfg; --forceendsil=false; --gamma=1; --garbage=false; --input=wav; --inputbinsize=100; --inputfeeding=false; --iter=100; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data/wav2letter++/THCH/lm/lexicon.txt; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/data/wav2letter++/THCH/lm/5-gram.bin; --lmtype=kenlm; --lmweight=4; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.10000000000000001; --lrcrit=0; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=50; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=8; --nthread_decoder=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --replabel=2; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data/wav2letter++/THCH/; --runname=thch_trainlogs; --samplerate=16000; --samplingstrategy=rand; --sclite=/data/wav2letter++/THCH/logs/; --seed=0; --show=true; --showletters=true; --silweight=-1; --skipoov=false; --smearing=max; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=|; --tag=; --target=tkn; --targettype=video; --test=data/test; --tokens=data/tokens.txt; --tokensdir=/data/wav2letter++/THCH/; --train=data/train; --trainWithWindow=false; --transdiag=0; --unkweight=-inf; --valid=data/dev; --weightdecay=0; --wordscore=2.2000000000000002; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0222 01:47:14.815636 6401 Decode.cpp:117] Number of classes (network): 6319
loadWords from /data/wav2letter++/THCH/lm/lexicon.txt
I0222 01:47:14.959055 6401 Utils.cpp:339] [Words] 173604 tokens loaded.
I0222 01:47:15.017617 6401 Decode.cpp:121] Number of words: 173604
I0222 01:47:15.067452 6401 Decode.cpp:181] [Dataset] Number of samples per thread: 7
I0222 01:47:17.783012 6409 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783017 6405 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783030 6407 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783030 6402 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783041 6404 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783012 6406 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783066 6408 Decode.cpp:262] [Decoder] LM constructed.
I0222 01:47:17.783017 6403 Decode.cpp:262] [Decoder] LM constructed.
Killed
I guess it met the same problem as acoustic model method did. Except LM and lexicon.txt, are there any more reasons lead this killed? Thank you very much!
Hi, for the -am
reappearing issue, I think it's a bug. I will send out a fix tomorrow.
Your lexicon file and LM look good to me. I believe the program is killed due to memory usage. We need to build a trie before decoding and in your case it will be huge I can imagine. You have 173604 words and 6319 tokens, so each node in the trie will have a children pointer vector of 6319 elements. According to my experience, 200K words with 5K tokens will get us a trie of size 15Gb. Looks like you are using 8 threads, then 120Gb memory will be required for only the tries.
One possible solution is to use less threads. While the other one is to make some code change, if you want. Hint: Move https://github.com/facebookresearch/wav2letter/blob/master/Decode.cpp#L252-L308 outside the decoding function so as to make LM and trie shared among all the threads.
Hi, for the
-am
reappearing issue, I think it's a bug. I will send out a fix tomorrow. Your lexicon file and LM look good to me. I believe the program is killed due to memory usage. We need to build a trie before decoding and in your case it will be huge I can imagine. You have 173604 words and 6319 tokens, so each node in the trie will have a children pointer vector of 6319 elements. According to my experience, 200K words with 5K tokens will get us a trie of size 15Gb. Looks like you are using 8 threads, then 120Gb memory will be required for only the tries. One possible solution is to use less threads. While the other one is to make some code change, if you want. Hint: Move https://github.com/facebookresearch/wav2letter/blob/master/Decode.cpp#L252-L308 outside the decoding function so as to make LM and trie shared among all the threads.
Hello,Thanks for your reply! I set thread=1
( --beamsize=1000, --beamscore=100
),and it does work!
my computer only have 15.6G memory and it takes almost 15.4G to build trie and do other operations.So the decode seems slowly very much.
I0222 06:12:12.423243 6539 Utils.cpp:339] [Words] 175353 tokens loaded.
I0222 06:12:12.484014 6539 Decode.cpp:121] Number of words: 175353
I0222 06:12:12.559342 6539 NumberedFilesLoader.cpp:29] Adding dataset /data/wav2letter++/THCH/data/test ...
I0222 06:12:12.741766 6539 NumberedFilesLoader.cpp:68] 669 files found.
I0222 06:12:19.028533 6539 Utils.cpp:102] Filtered 0/669 samples
I0222 06:12:19.028617 6539 W2lNumberedFilesDataset.cpp:57] Total batches (i.e. iters): 669
I0222 06:12:19.028681 6539 Decode.cpp:145] [Serialization] Running forward pass ...
I0222 06:12:27.173827 6539 Decode.cpp:181] [Dataset] Number of samples per thread: 50
I0222 06:12:28.619987 6579 Decode.cpp:262] [Decoder] LM constructed.
I0222 06:27:08.930788 6579 Decode.cpp:296] [Decoder] Trie planted.
I0222 06:30:53.158293 6579 Decode.cpp:308] [Decoder] Trie smeared.
I0222 06:30:53.177062 6579 Decode.cpp:314] [Decoder] Decoder loaded in thread: 0
|T|: 而 此时 正赶上 咸阳 地 市 机构 变化 原 咸阳市 改为 秦都区 咸阳 地区 改为 咸阳市
|P|: 而且 是 站 咸阳 地 机构 变化 原 为 其中 咸阳 地区 改为 咸阳市
|t|: 而|此时|正赶上|咸阳|地|市|机构|变化|原|咸阳市|改为|秦都区|咸阳|地区|改为|咸阳市
|p|: 而且|是|站|咸阳|地|机构|变化|原|为|其中|咸阳|地区|改为|咸阳市
It almost need 18 mins to prepare trie and other things before print decode result. and 9 mins to decode one sample. Anyway,finally it works! Maybe, change some parameters can speed it up,I'll try. Thanks for your help!
As you know, large search space (token set size) will definitely hurt the performance of any beam search engine. We only optimized our decoder on English, whose token set size is about 30. Suggestions:
-nthread_decoder
.
Hello, I have trained in THCH30 dataset, train and test stage is OK,but decode failed!
failed information:
-am
-emission_dir
I have been tried change some parameters,but no help. Any ideals for this problem? Looking for your reply! Thank you very much!