flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Difference results when convert model #954

Closed tranmanhdat closed 3 years ago

tranmanhdat commented 3 years ago

Bug Description

I use branch v0.2 to train and convert model stream convnet. After 65 epochs i stop training with custom data and test trained model. When i use Decoder Beam Search Decoder, result is really bad (attach as transcript.lst.log) But when i use the converted model, result is acceptable (attach as converted_model.txt)

Reproduction Steps

Train 65 epochs on about 1000hours audios, config when train attach as train_asr.cfg, config when decode attach as decode_500ms_right_future_ngram_other.cfg, config for converted model attach as decoder_options.json

Platform and Hardware

Intel(R) Xeon(R) CPU E5-2690 v4@ 2.60GHz, Tesla V100 16gb, ubuntu 18.04, run on docker-nvidia-2 attach.zip

tlikhomanenko commented 3 years ago

Hey!

tranmanhdat commented 3 years ago

Hi, here is result when using Test binary

[Test /root/src/test_decoder/transcript.lst (14 samples) in 3.23151s (actual decoding time 0.231s/sample) -- WER: 196.721, LER: 128.406]

tranmanhdat commented 3 years ago

below is result when i use Test binary with --uselexicon=false, but result is still bad i think --uselexicon flag is ignored because decodertype flag is word ( as explain here decode flags

|T|: t ỷ l ệ n ư ớ c t h ấ t t h o á t g i ả m t ừ b a m ư ơ i p h ầ n t r ă m n ă m h a i n g à n m ư ờ i b ố n n a y h ạ x u ố n g c ò n h a i m ư ơ i b a p h ẩ y n ă m p h ầ n t r ă m |P|: l o n g l à t h u ậ n l à s a o l à m à y l à c ả i l à đ ả m l à c ò n l à v à o l à s ự l à d â n l à h ọ c l à n ó i l à t a l à t r ờ i l à đ ộ n g l à h à n g l à đ i ề u l à đ o à n l à l i n h l à b ạ n l à t a l à s ự l à v à o l à v à i l à n ó i l à d â n l à h ọ c _ l à [sample: train337326, WER: 200%, LER: 128.205%, total WER: 200%, total LER: 128.205%, progress (thread 0): 10%] |T|: t ô i k h ô n g b a o g i ờ b u ồ n t r ư ớ c n h ữ n g n ỗ i b u ồ n n g ư ờ i k h á c m a n g đ ế n |P|: a n h l à m ộ t l à t ố t l à n h â n l à m ắ c l à t h ứ l à đ ó l à d ị l à m ắ c l à m à l à c h ư a l à t u ổ i l à t h ể l à [sample: train13719, WER: 200%, LER: 119.048%, total WER: 200%, total LER: 125%, progress (thread 0): 20%] |T|: t h ờ i đ i ể m h i ệ n n a y l à v à o đ ầ u t h ờ i k ỳ c a o đ i ể m d ị c h t r ù n g v ớ i d ị p t ế t n g u y ê n đ á n n h i ề u k h ả n ă n g d ị c h c ú m g i a c ầ m s ẽ t i ế p t ụ c c ò n l â y l a n t r o n g c á c n g à y t ớ i |P|: a n h l à n g h ĩ l à m ư ơ i l à đ i ề u l à c ó l à t ừ l à đ â y l à g i a l à k h í l à v i ệ t l à n g h ĩ l à m ô n l à h o a l à c ũ n g l à b ã i l à m ệ n h l à s ử l à n ấ p l à b a l à n g à n h l à v ụ l à m ô n l à ẫ n l à v ị l à d ầ n l à đ i l à t h ô n g l à t ư ợ n g l à b ạ n l à c o l à c ơ n l à e m l à đ ư ợ c l à g i ờ l à t r ư ờ n g _ l à [sample: train48808, WER: 194.286%, LER: 121.935%, total WER: 197.333%, total LER: 123.582%, progress (thread 0): 30%] |T|: đ i ể m q u a m ộ t v à i t á c p h ẩ m đ ã v à đ a n g t h ự c h i ệ n t h ế g i ớ i g i ả i t r í v ấ n đ ề v ố n n h ạ y c ả m v à l u ô n n ằ m t r o n g s ự t h u h ú t c ủ a c ô n g c h ú n g đ ư ợ c c á c đ ạ o d i ễ n k h a i t h á c t r i ệ t đ ể |P|: c á i l à c ô l à c á i l à q u a n h l à t h ư ờ n g l à h ồ n l à đ ể l à t h ì l à đ ị n h l à đ ố i l à m ư ơ i l à n h i ề u l à h o à n l à c h i ế n l à t ổ n g l à đ á n h l à n g h i ệ p l à r i ê n g l à m ẻ l à ý l à t h ì l à t á c l à m o n g l à e m l à t h ấ y l à l ờ i l à h ề l à v à l à r ấ t l à l à c h o l à đ ư ợ c l à l à s ĩ l à m ư a l à c h ú c l à b ấ y l à r a _ l à [sample: train49111, WER: 197.368%, LER: 123.636%, total WER: 197.345%, total LER: 123.6%, progress (thread 0): 40%] |T|: t ạ i t r ụ s ở c ơ q u a n đ i ề u t r a u y ê n v à n h i đ ã t h ú n h ậ n h à n h v i p h ạ m t ộ i c ủ a m ì n h |P|: t r ê n l à m ế n l à c h i a l à l ú c l à n h à l à c ù n g l à c h ơ i l à s ử l à t h ì l à t h í l à đ ể l à đ ố i l à h ợ p l à x u ấ t l à x e m l à v ư ơ n g l à n g h i ê n l à v à l à đ ã l à [sample: train31028, WER: 200%, LER: 138.961%, total WER: 197.727%, total LER: 125.65%, progress (thread 0): 50%] |T|: b ạ n n ê n đ ổ i t i ề n t ạ i n g â n h à n g v ì đ ổ i k h á c h s ạ n s ẽ c h ị u t h ê m p h ụ p h í |P|: đ ầ u l à đ a n g l à v i l à k ế t l à t r ê n l à b i ê n l à đ ư ờ n g l à c h í n h l à v i l à l à m l à h o ặ c l à l ồ l à đ i l à c o i l à l ậ p l à đ o ạ n l à t r ò l à [sample: train23813, WER: 200%, LER: 135.714%, total WER: 197.987%, total LER: 126.739%, progress (thread 0): 60%] |T|: c ó n h ữ n g t i n đ ồ n v ô h ạ i c ũ n g c ó n h ữ n g t i n ả n h h ư ở n g đ ế n đ ờ i s ố n g s i n h h o ạ t đ ế n s ự b ì n h y ê n c ủ a g i a đ ì n h m ì n h |P|: a n h l à đ ó l à k i n h l à á m l à g i ả i l à n h ạ c l à n h ữ n g l à k h ô n g l à đ ó l à k i n h l à k h u l à h ò a l à t h ể l à b à i l à t h á n g l à q u a l à t r u y ề n l à t h ể l à t h ấ y l à q u ý l à p h ó n g l à v à l à v ị l à b i ệ t l à đ ã l à [sample: train48618, WER: 196%, LER: 130.556%, total WER: 197.701%, total LER: 127.285%, progress (thread 0): 70%] |T|: n h ữ n g h ộ c â u t r ộ m đ i ệ n n à y l u ô n c ử n g ư ờ i g á c n g a y đ i ể m c â u t r ộ m đ i ệ n |P|: đ ó l à k i ể m l à đ ạ i l à k i ệ t l à c h ế l à n h ư l à m ẹ l à m ã l à m à l à v ờ l à t h í c h l à n g h ĩ l à đ ạ i l à k i ệ t l à c h ế l à [sample: train37515, WER: 200%, LER: 120.588%, total WER: 197.884%, total LER: 126.731%, progress (thread 0): 80%] |T|: n h ờ c â y c ầ u ấ y b â y g i ờ n g ư ờ i d â n q u ê t ô i m u ố n đ i m u ố n v ề l ú c n à o c ũ n g đ ư ợ c |P|: g i ó l à v i ệ n l à t ỉ n h l à m ọ i l à b ả n l à n h â n l à m à l à h ọ l à l ắ n g l à a n h l à n ữ a l à v ề l à n ữ a l à t ô i l à n ế u l à l à n h ữ n g l à c h o _ l à [sample: train13290, WER: 188.889%, LER: 129.73%, total WER: 197.101%, total LER: 126.979%, progress (thread 0): 90%] |T|: h a g i ả t ỉ n h ư b ữ a n a y m ì n h đ i l ạ i c á i h ộ n à y |P|: m l à g i ả l à l o n g l à v ớ i l à t ù y l à đ i ề u l à đ ã l à v ề l à p h ả i l à c ủ a l à k i ể m l à n h ư _ l à [sample: train_379226, WER: 191.667%, LER: 138.636%, total WER: 196.804%, total LER: 127.524%, progress (thread 0): 100%] I0223 07:59:57.078758 58 Test.cpp:317] ------ I0223 07:59:57.078769 58 Test.cpp:318] [Test transcript.lst (10 samples) in 2.76306s (actual decoding time 0.276s/sample) -- WER: 196.804, LER: 127.524]

tlikhomanenko commented 3 years ago

Test binary is not using decoder params, only uselexicon to define to use or not lexicon to map back to words and compute wer.

Can you send then your token file, lexicon (head of the files) and also send your training log? Do you have normal WER in the training log? Can you confirm on usage the same commit for both training and test binaries?

tranmanhdat commented 3 years ago

i update result with Test binary, i use the same commit ( on same container for trainning and test binaries) here is token Screenshot from 2021-02-24 16-01-34 and here is lexicon Screenshot from 2021-02-24 16-01-45

and below is last log when training, i wonder how wer is calculated, i dont understand why train wer high (~30) while wer of each part is low (2-20) ( i use merger all data to train, and test on total data)

epoch: 71 | nupdates: 1990000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:44:39 | bch(ms): 267.99 | smp(ms): 126.03 | fwd(ms): 48.85 | crit-fwd(ms): 7.76 | bwd(ms): 68.27 | optim(ms): 22.99 | loss: 5.67196 | train-TER: 19.11 | train-WER: 27.83 | /root/src/data/dat/lst/command.lst-loss: 0.50689 | /root/src/data/dat/lst/command.lst-TER: 4.39 | /root/src/data/dat/lst/command.lst-WER: 7.18 | /root/src/data/dat/lst/data_record_disabled.lst-loss: 0.65055 | /root/src/data/dat/lst/data_record_disabled.lst-TER: 1.88 | /root/src/data/dat/lst/data_record_disabled.lst-WER: 3.07 | /root/src/data/dat/lst/data_record_web.lst-loss: 1.78174 | /root/src/data/dat/lst/data_record_web.lst-TER: 6.02 | /root/src/data/dat/lst/data_record_web.lst-WER: 10.71 | /root/src/data/dat/lst/self_prepare_data.lst-loss: 3.21540 | /root/src/data/dat/lst/self_prepare_data.lst-TER: 6.47 | /root/src/data/dat/lst/self_prepare_data.lst-WER: 10.47 | /root/src/data/dat/lst/speech_zalo_data.lst-loss: 1.86290 | /root/src/data/dat/lst/speech_zalo_data.lst-TER: 9.14 | /root/src/data/dat/lst/speech_zalo_data.lst-WER: 15.07 | /root/src/data/dat/lst/vin_big_data.lst-loss: 2.08646 | /root/src/data/dat/lst/vin_big_data.lst-TER: 5.65 | /root/src/data/dat/lst/vin_big_data.lst-WER: 10.05 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.54659 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.61 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.51 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-loss: 1.35663 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-TER: 5.17 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-WER: 8.91 | avg-isz: 645 | avg-tsz: 065 | max-tsz: 548 | hrs: 143.53 | thrpt(sec/sec): 192.80 epoch: 72 | nupdates: 2000000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:44:14 | bch(ms): 265.50 | smp(ms): 128.11 | fwd(ms): 46.93 | crit-fwd(ms): 7.42 | bwd(ms): 66.05 | optim(ms): 22.87 | loss: 5.66910 | train-TER: 18.90 | train-WER: 27.38 | /root/src/data/dat/lst/command.lst-loss: 0.51974 | /root/src/data/dat/lst/command.lst-TER: 4.47 | /root/src/data/dat/lst/command.lst-WER: 7.27 | /root/src/data/dat/lst/data_record_disabled.lst-loss: 0.65098 | /root/src/data/dat/lst/data_record_disabled.lst-TER: 1.86 | /root/src/data/dat/lst/data_record_disabled.lst-WER: 3.08 | /root/src/data/dat/lst/data_record_web.lst-loss: 1.80543 | /root/src/data/dat/lst/data_record_web.lst-TER: 6.17 | /root/src/data/dat/lst/data_record_web.lst-WER: 10.91 | /root/src/data/dat/lst/self_prepare_data.lst-loss: 3.08982 | /root/src/data/dat/lst/self_prepare_data.lst-TER: 7.24 | /root/src/data/dat/lst/self_prepare_data.lst-WER: 10.89 | /root/src/data/dat/lst/speech_zalo_data.lst-loss: 1.83693 | /root/src/data/dat/lst/speech_zalo_data.lst-TER: 9.37 | /root/src/data/dat/lst/speech_zalo_data.lst-WER: 15.22 | /root/src/data/dat/lst/vin_big_data.lst-loss: 2.06547 | /root/src/data/dat/lst/vin_big_data.lst-TER: 5.83 | /root/src/data/dat/lst/vin_big_data.lst-WER: 10.12 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.52092 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.56 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.50 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-loss: 1.36247 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-TER: 5.38 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-WER: 9.16 | avg-isz: 625 | avg-tsz: 063 | max-tsz: 568 | hrs: 138.93 | thrpt(sec/sec): 188.38 epoch: 72 | nupdates: 2010000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:42:16 | bch(ms): 253.65 | smp(ms): 118.47 | fwd(ms): 45.94 | crit-fwd(ms): 7.27 | bwd(ms): 64.83 | optim(ms): 22.92 | loss: 5.72439 | train-TER: 18.47 | train-WER: 27.14 | /root/src/data/dat/lst/command.lst-loss: 0.53452 | /root/src/data/dat/lst/command.lst-TER: 5.58 | /root/src/data/dat/lst/command.lst-WER: 8.72 | /root/src/data/dat/lst/data_record_disabled.lst-loss: 0.65339 | /root/src/data/dat/lst/data_record_disabled.lst-TER: 1.88 | /root/src/data/dat/lst/data_record_disabled.lst-WER: 3.14 | /root/src/data/dat/lst/data_record_web.lst-loss: 1.83295 | /root/src/data/dat/lst/data_record_web.lst-TER: 6.25 | /root/src/data/dat/lst/data_record_web.lst-WER: 11.10 | /root/src/data/dat/lst/self_prepare_data.lst-loss: 3.11749 | /root/src/data/dat/lst/self_prepare_data.lst-TER: 6.45 | /root/src/data/dat/lst/self_prepare_data.lst-WER: 10.37 | /root/src/data/dat/lst/speech_zalo_data.lst-loss: 1.86981 | /root/src/data/dat/lst/speech_zalo_data.lst-TER: 9.28 | /root/src/data/dat/lst/speech_zalo_data.lst-WER: 15.19 | /root/src/data/dat/lst/vin_big_data.lst-loss: 2.05404 | /root/src/data/dat/lst/vin_big_data.lst-TER: 5.71 | /root/src/data/dat/lst/vin_big_data.lst-WER: 10.12 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.53748 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.53 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.44 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-loss: 1.34302 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-TER: 5.26 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-WER: 9.10 | avg-isz: 603 | avg-tsz: 061 | max-tsz: 454 | hrs: 134.15 | thrpt(sec/sec): 190.40 epoch: 72 | nupdates: 2020000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:42:48 | bch(ms): 256.83 | smp(ms): 122.36 | fwd(ms): 45.80 | crit-fwd(ms): 7.22 | bwd(ms): 64.33 | optim(ms): 22.79 | loss: 5.65235 | train-TER: 18.52 | train-WER: 27.79 | /root/src/data/dat/lst/command.lst-loss: 0.52221 | /root/src/data/dat/lst/command.lst-TER: 4.87 | /root/src/data/dat/lst/command.lst-WER: 7.68 | /root/src/data/dat/lst/data_record_disabled.lst-loss: 0.66648 | /root/src/data/dat/lst/data_record_disabled.lst-TER: 1.86 | /root/src/data/dat/lst/data_record_disabled.lst-WER: 3.06 | /root/src/data/dat/lst/data_record_web.lst-loss: 1.82404 | /root/src/data/dat/lst/data_record_web.lst-TER: 6.13 | /root/src/data/dat/lst/data_record_web.lst-WER: 10.89 | /root/src/data/dat/lst/self_prepare_data.lst-loss: 3.21298 | /root/src/data/dat/lst/self_prepare_data.lst-TER: 6.28 | /root/src/data/dat/lst/self_prepare_data.lst-WER: 10.15 | /root/src/data/dat/lst/speech_zalo_data.lst-loss: 1.86696 | /root/src/data/dat/lst/speech_zalo_data.lst-TER: 9.12 | /root/src/data/dat/lst/speech_zalo_data.lst-WER: 14.90 | /root/src/data/dat/lst/vin_big_data.lst-loss: 2.08728 | /root/src/data/dat/lst/vin_big_data.lst-TER: 5.68 | /root/src/data/dat/lst/vin_big_data.lst-WER: 9.89 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.53488 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.52 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.40 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-loss: 1.38036 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-TER: 5.11 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-WER: 8.80 | avg-isz: 604 | avg-tsz: 061 | max-tsz: 386 | hrs: 134.37 | thrpt(sec/sec): 188.35 epoch: 72 | nupdates: 2030000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:43:13 | bch(ms): 259.32 | smp(ms): 123.36 | fwd(ms): 46.47 | crit-fwd(ms): 7.17 | bwd(ms): 65.06 | optim(ms): 22.78 | loss: 5.66278 | train-TER: 19.22 | train-WER: 28.22 | /root/src/data/dat/lst/command.lst-loss: 0.50380 | /root/src/data/dat/lst/command.lst-TER: 4.56 | /root/src/data/dat/lst/command.lst-WER: 7.41 | /root/src/data/dat/lst/data_record_disabled.lst-loss: 0.63531 | /root/src/data/dat/lst/data_record_disabled.lst-TER: 1.85 | /root/src/data/dat/lst/data_record_disabled.lst-WER: 3.02 | /root/src/data/dat/lst/data_record_web.lst-loss: 1.78039 | /root/src/data/dat/lst/data_record_web.lst-TER: 6.21 | /root/src/data/dat/lst/data_record_web.lst-WER: 10.98 | /root/src/data/dat/lst/self_prepare_data.lst-loss: 3.07265 | /root/src/data/dat/lst/self_prepare_data.lst-TER: 6.50 | /root/src/data/dat/lst/self_prepare_data.lst-WER: 10.40 | /root/src/data/dat/lst/speech_zalo_data.lst-loss: 1.83221 | /root/src/data/dat/lst/speech_zalo_data.lst-TER: 9.28 | /root/src/data/dat/lst/speech_zalo_data.lst-WER: 15.06 | /root/src/data/dat/lst/vin_big_data.lst-loss: 2.02611 | /root/src/data/dat/lst/vin_big_data.lst-TER: 5.68 | /root/src/data/dat/lst/vin_big_data.lst-WER: 10.01 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.52211 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.55 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.44 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-loss: 1.31002 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-TER: 5.35 | /root/src/data/dat/lst/zalo_contribute_to_vlsp.lst-WER: 9.13 | avg-isz: 611 | avg-tsz: 062 | max-tsz: 548 | hrs: 135.95 | thrpt(sec/sec): 188.74

tlikhomanenko commented 3 years ago

During training augmentation is used, so train loss and wer are reported on the augmented data, that is why if you reevaluate on the train set separately it will have lower wer because during evaluation augmentation is not used. For the training data we use evaluation on augmented data if augmentation is using during training.

Regarding WER computation with Test binary: could you run it on the whole /root/src/data/dat/lst/vin_big_data.lst, for example, and compare its WER with the one you see in log?

tranmanhdat commented 3 years ago

i tested with another part instead of _/root/src/data/dat/lst/vin_bigdata.lst, result show that wer is higher than wer when training ( 7-8 and 81)

Screenshot from 2021-02-25 10-18-32

tlikhomanenko commented 3 years ago

Ok, with this you can see that LER is exactly the same as in the log. So problem with the WER with Test binary for now.

Did you use --uselexicon=false when running this test? If no, can you rerun with it and post the same output?

tranmanhdat commented 3 years ago

oh, wer is good, i wonder why? Screenshot from 2021-02-25 11-51-31

tranmanhdat commented 3 years ago

i tested Decoder binary with options: --uselexicon false --decodertype wrd --beamsize 100 --beamsizetoken 100 --beamthreshold 20 --lmweight 0.6 --wordscore 0.6 result is accepted, so how do i change the decode flags file to get the same result?

[sample: train26463, WER: 78.9474%, LER: 80%, slice WER: 83.3333%, slice LER: 83.9779%, decoded samples (thread 0): 2] |T|: thời điểm hiện nay là vào đầu thời kỳ cao điểm dịch trùng với dịp tết nguyên đán nhiều khả năng dịch cúm gia cầm sẽ tiếp tục còn lây lan trong các ngày tới |P|: tôi có thế thời thời điểm hiện nay là là vào đầu đầu thời kỳ kỳ cao điểm điểm ạ cũng trước chiến dịch dịch vụ chung chung với dịp tết nguyên nguyên đán thì và và nhiều khả năng dịch dịch cúm gia các cầm sẽ tiếp tục bị việc hơn những không còn ở lây lang trong trong các ngày tới tới tới đến đến và này này |t|: t h ờ i | đ i ể m | h i ệ n | n a y | l à | v à o | đ ầ u | t h ờ i | k ỳ | c a o | đ i ể m | d ị c h | t r ù n g | v ớ i | d ị p | t ế t | n g u y ê n | đ á n | n h i ề u | k h ả | n ă n g | d ị c h | c ú m | g i a | c ầ m | s ẽ | t i ế p | t ụ c | c ò n | l â y | l a n | t r o n g | c á c | n g à y | t ớ i |p|: t ô i | c ó | t h ế | t h ờ i | t h ờ i | đ i ể m | h i ệ n | n a y | l à | l à | v à o | đ ầ u | đ ầ u | t h ờ i | k ỳ | k ỳ | c a o | đ i ể m | đ i ể m | ạ | c ũ n g | t r ư ớ c | c h i ế n | d ị c h | d ị c h | v ụ | c h u n g | c h u n g | v ớ i | d ị p | t ế t | n g u y ê n | n g u y ê n | đ á n | t h ì | v à | v à | n h i ề u | k h ả | n ă n g | d ị c h | d ị c h | c ú m | g i a | c á c | c ầ m | s ẽ | t i ế p | t ụ c | b ị | v i ệ c | h ơ n | n h ữ n g | k h ô n g | c ò n | ở | l â y | l a n g | t r o n g | t r o n g | c á c | n g à y | t ớ i | t ớ i | t ớ i | đ ế n | đ ế n | v à | n à y | n à y | x u [sample: train48808, WER: 105.714%, LER: 102.116%, slice WER: 105.714%, slice LER: 102.116%, decoded samples (thread 2): 1] |T|: điểm qua một vài tác phẩm đã và đang thực hiện thế giới giải trí vấn đề vốn nhạy cảm và luôn nằm trong sự thu hút của công chúng được các đạo diễn khai thác triệt để |P|: một điểm điểm điểm điểm qua một một vài tác tác tác phẩm các đã áp và đang đang được thực hiện trên thế giới giới giải trí và và vấn vấn đề vốn vấn nhạy đại cảm và và luôn luôn nằm trong trong sự sự thu hút của của công công chúng chúng và được được các đại đạo đạo diễn hai khai thác triển để để học này |t|: đ i ể m | q u a | m ộ t | v à i | t á c | p h ẩ m | đ ã | v à | đ a n g | t h ự c | h i ệ n | t h ế | g i ớ i | g i ả i | t r í | v ấ n | đ ề | v ố n | n h ạ y | c ả m | v à | l u ô n | n ằ m | t r o n g | s ự | t h u | h ú t | c ủ a | c ô n g | c h ú n g | đ ư ợ c | c á c | đ ạ o | d i ễ n | k h a i | t h á c | t r i ệ t | đ ể |p|: m ộ t | đ i ể m | đ i ể m | đ i ể m | đ i ể m | q u a | m ộ t | m ộ t | v à i | t á c | t á c | t á c | p h ẩ m | c á c | đ ã | á p | v à | đ a n g | đ a n g | đ ư ợ c | t h ự c | h i ệ n | t r ê n | t h ế | g i ớ i | g i ớ i | g i ả i | t r í | v à | v à | v ấ n | v ấ n | đ ề | v ố n | v ấ n | n h ạ y | đ ạ i | c ả m | v à | v à | l u ô n | l u ô n | n ằ m | t r o n g | t r o n g | s ự | s ự | t h u | h ú t | c ủ a | c ủ a | c ô n g | c ô n g | c h ú n g | c h ú n g | v à | đ ư ợ c | đ ư ợ c | c á c | đ ạ i | đ ạ o | đ ạ o | d i ễ n | h a i | k h a i | t h á c | t r i ể n | đ ể | đ ể | h ọ c | _ n à y | [sample: train49111, WER: 89.4737%, LER: 87.1287%, slice WER: 89.4737%, slice LER: 87.1287%, decoded samples (thread 5): 1] |T|: bạn không thể đòi hỏi bệnh phải được điều trị khỏi ngay mà cần có thời gian để loại bỏ những tế bào quái ác đó |P|: một bạn bạn không không thể thể đòi hỏi được bệnh phải được điều trị khỏi ngay ngay và và mà mà cần cần có thời gian để để loại bỏ những tế bào của ngoài ác đó đó đó đó đó và đó nhỉ |t|: b ạ n | k h ô n g | t h ể | đ ò i | h ỏ i | b ệ n h | p h ả i | đ ư ợ c | đ i ề u | t r ị | k h ỏ i | n g a y | m à | c ầ n | c ó | t h ờ i | g i a n | đ ể | l o ạ i | b ỏ | n h ữ n g | t ế | b à o | q u á i | á c | đ ó |p|: m ộ t | b ạ n | b ạ n | k h ô n g | k h ô n g | t h ể | t h ể | đ ò i | h ỏ i | đ ư ợ c | b ệ n h | p h ả i | đ ư ợ c | đ i ề u | t r ị | k h ỏ i | n g a y | n g a y | v à | v à | m à | m à | c ầ n | c ầ n | c ó | t h ờ i | g i a n | đ ể | đ ể | l o ạ i | b ỏ | n h ữ n g | t ế | b à o | c ủ a | n g o à i | á c | đ ó | đ ó | đ ó | đ ó | đ ó | v à | đ ó | _ n h ỉ | [sample: train48756, WER: 76.9231%, LER: 70.3704%, slice WER: 73.1707%, slice LER: 72.3502%, decoded samples (thread 6): 2] |T|: bảo châu đã tạo được bản sắc của một cô chủ không ưa cánh đàn ông mà chỉ thích kinh doanh nhà hàng |P|: bảo bảo bảo châu châu đã tạo được bản sắc của một một cô chủ chủ và không không không ưa thích thì cánh đàn ông mà mà chỉ thích thích kinh doanh nhà hàng hàng thì và và và và |t|: b ả o | c h â u | đ ã | t ạ o | đ ư ợ c | b ả n | s ắ c | c ủ a | m ộ t | c ô | c h ủ | k h ô n g | ư a | c á n h | đ à n | ô n g | m à | c h ỉ | t h í c h | k i n h | d o a n h | n h à | h à n g |p|: b ả o | b ả o | b ả o | c h â u | c h â u | đ ã | t ạ o | đ ư ợ c | b ả n | s ắ c | c ủ a | m ộ t | m ộ t | c ô | c h ủ | c h ủ | v à | k h ô n g | k h ô n g | k h ô n g | ư a | t h í c h | t h ì | c á n h | đ à n | ô n g | m à | m à | c h ỉ | t h í c h | t h í c h | k i n h | d o a n h | n h à | h à n g | h à n g | t h ì | v à | v à | v à | v à | [sample: train_48188, WER: 78.2609%, LER: 80%, slice WER: 61.1111%, slice LER: 60%, decoded samples (thread 4): 3] I0225 05:06:20.489257 420 Decode.cpp:721] ------ [Decode /root/src/test_decoder/transcript.lst (14 samples) in 3.85672s (actual decoding time 0.484s/sample) -- WER: 75.082, LER: 76.1635]

tlikhomanenko commented 3 years ago

In test binary we use lexicon to map sequence of tokens into word to compute wer if uselexicon is true. Otherwise wordseparator is used to make from tokens words. So if all your words are not in the lexicon, then LER will be low while WER will be high.

For beam search if decodertype is word uselexicon flag is ignored and lexicon is restricting the words you can infer. Better to include your lm andd train+dev data words into lexicon for the final decoding (we use often top 200k words if you have large lm corpus)

Test binary is giving you upper bound on WER you can get as it is greedy path. Beam search with LM could improve this value but you need to do parameters search. Details on how we often do parameters optimization please check in the paper https://arxiv.org/abs/1911.08460, appendix A.

Closing issue for now, but feel free to reach out again if there are any problems.

tranmanhdat commented 3 years ago

thank you for your help, when i decode i have a problem when train wer is good, but when decode using model trained wer is bad, many duplicated words predicted below is training log

epoch: 44 | nupdates: 121000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:59 | bch(ms): 239.59 | smp(ms): 1.17 | fwd(ms): 33.94 | crit-fwd(ms): 2.77 | bwd(ms): 188.98 | optim(ms): 15.48 | loss: 3.79079 | train-TER: 14.49 | train-WER: 21.37 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.48067 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.62 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.56 | avg-isz: 465 | avg-tsz: 042 | max-tsz: 079 | hrs: 41.40 | thrpt(sec/sec): 622.04 epoch: 44 | nupdates: 122000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:57 | bch(ms): 237.49 | smp(ms): 0.45 | fwd(ms): 32.97 | crit-fwd(ms): 2.60 | bwd(ms): 188.12 | optim(ms): 15.50 | loss: 3.89036 | train-TER: 14.77 | train-WER: 21.74 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.46747 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.55 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.48 | avg-isz: 454 | avg-tsz: 041 | max-tsz: 080 | hrs: 40.37 | thrpt(sec/sec): 611.90 epoch: 44 | nupdates: 123000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:04:01 | bch(ms): 241.55 | smp(ms): 0.45 | fwd(ms): 34.18 | crit-fwd(ms): 2.68 | bwd(ms): 191.17 | optim(ms): 15.32 | loss: 3.76995 | train-TER: 14.20 | train-WER: 21.16 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.47600 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.63 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.55 | avg-isz: 477 | avg-tsz: 043 | max-tsz: 079 | hrs: 42.41 | thrpt(sec/sec): 631.99 epoch: 45 | nupdates: 124000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:59 | bch(ms): 239.16 | smp(ms): 0.62 | fwd(ms): 33.44 | crit-fwd(ms): 2.65 | bwd(ms): 188.99 | optim(ms): 15.47 | loss: 3.78246 | train-TER: 17.46 | train-WER: 24.14 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.46924 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.58 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.55 | avg-isz: 458 | avg-tsz: 042 | max-tsz: 080 | hrs: 40.76 | thrpt(sec/sec): 613.47 epoch: 45 | nupdates: 125000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:58 | bch(ms): 238.58 | smp(ms): 0.55 | fwd(ms): 33.48 | crit-fwd(ms): 2.68 | bwd(ms): 188.58 | optim(ms): 15.55 | loss: 3.79294 | train-TER: 17.61 | train-WER: 25.35 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.45276 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.53 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.42 | avg-isz: 461 | avg-tsz: 042 | max-tsz: 080 | hrs: 41.01 | thrpt(sec/sec): 618.75 epoch: 46 | nupdates: 126000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:04:02 | bch(ms): 242.96 | smp(ms): 0.67 | fwd(ms): 34.33 | crit-fwd(ms): 2.72 | bwd(ms): 192.03 | optim(ms): 15.33 | loss: 3.69775 | train-TER: 16.78 | train-WER: 25.13 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.46400 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.74 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.62 | avg-isz: 475 | avg-tsz: 043 | max-tsz: 079 | hrs: 42.26 | thrpt(sec/sec): 626.23 epoch: 46 | nupdates: 127000 | lr: 0.400000 | lrcriterion: 0.000000 | runtime: 00:03:57 | bch(ms): 237.62 | smp(ms): 0.53 | fwd(ms): 33.14 | crit-fwd(ms): 2.64 | bwd(ms): 188.17 | optim(ms): 15.55 | loss: 3.79142 | train-TER: 10.85 | train-WER: 16.66 | /root/src/data/dat/lst/vlsp_2019_100k.lst-loss: 0.45343 | /root/src/data/dat/lst/vlsp_2019_100k.lst-TER: 1.51 | /root/src/data/dat/lst/vlsp_2019_100k.lst-WER: 2.39 | avg-isz: 456 | avg-tsz: 041 | max-tsz: 080 | hrs: 40.57 | thrpt(sec/sec): 614.64

and decode result, notice i revert commit and rebuild as issues duplicated mention Screenshot from 2021-03-04 08-57-55

and options for decode: --uselexicon true --decodertype wrd --beamsize 100 --beamsize 100 --beamthreshold 20 --lmweight 0.6 --wordscore 0.6 --eosscore 0 --silscore 0 --unkscore 0 --smearing max

tlikhomanenko commented 3 years ago

The fix you reverted is only related to the online inference, please keep it back. For dupplication problem you need this fix https://github.com/facebookresearch/flashlight/commit/9ef0d588b3fbcae11b65c12b689787964b1e8b90, see issue https://github.com/facebookresearch/flashlight/issues/265 which is exactly what you have here. (The fix is in the current fl master, so you can either git pull, or apply commit I mentioned).

tranmanhdat commented 3 years ago

i think that fix can lead to another error, for example in my language ( Vietnamese), it has some words like "từ từ", "ngang ngang", "xanh xanh".... and model will predict missing word

tlikhomanenko commented 3 years ago

do you have word separator "xanh _ xanh"? If yes - it will process correctly. Otherwise, I am curious to see your lexicon. Please try the fix and if you see some discrepancy between word and token predictions - post here example, we need to see more use-cases to know for sure how to fix the problem!

tranmanhdat commented 3 years ago

I mean, i have word "_xanh" but in sentence "...acb_xanh_xanh_xyz.." so when predict, does it predict "_xanh_xanh.." or word only appear once?

tlikhomanenko commented 3 years ago

No-no. If you have in the sentence two words repetition it will work fine. The issue was associated with wrong processing path following lexicon. So please use the commit I pointed, this should fix your issue and not introduce another one.

tranmanhdat commented 3 years ago

thank for your help