Building ngram model from my own corpus

davyuan commented 2 years ago

Hello,

thanks for sharing your ngram model, it does work fine with the Libre corpus in my test. Now that I want to take it one step further to train your model on my own corpus, I'd like to train my own ngram as well. I'm using the NeMo script but it doesn't seem to work. Here is the command I'm using.

python train_kenlm.py --nemo_model_file stt_en_conformer_ctc_medium.nemo --train_file MyCorpus.txt --kenlm_bin_path ../../../../kenlm/build/bin --kenlm_model_file ./isd.arpa --ngram_length 6

Here stt_en_conformer_ctc_medium.nemo is pretrained and downloaded with NeMo, and MyCorpus.txt is the collection of ground truth text. I didn't encode it as I suppose the script will do it.

What am I missing?

thanks! David

burchim commented 2 years ago

Hi David,

What doesn't seem to work with the ngram ? Are you encountering some errors using the train_kenlm.py script ? Or is it because the trained ngram doesn't help to reduce the WER during decoding ?

Best, Maxime

davyuan commented 2 years ago

Hi Maxime,

It is the WER that I'm not seeing reduction on. I'm attaching the full logs below..

(nemo) david@DL1:~/dev/data2/NeMo/scripts/asr_language_modeling/ngram_lm$ python train_kenlm.py --nemo_model_file stt_en_conformer_ctc_medium.nemo --train_file MyCorpus.txt --kenlm_bin_path ../../../../kenlm/build/bin --kenlm_model_file ./isd.arpa --ngram_length 6 [NeMo W 2022-07-24 12:13:40 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo I 2022-07-24 12:13:47 train_kenlm:87] Loading nemo model 'stt_en_conformer_ctc_medium.nemo' ... Using eos_token, but it is not set yet. Using bos_token, but it is not set yet. [NeMo I 2022-07-24 12:13:50 mixins:166] Tokenizer AutoTokenizer initialized with 128 tokens [NeMo W 2022-07-24 12:13:50 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: /data/librispeech_withsp2/manifests/librispeech_train_withSP.json sample_rate: 16000 batch_size: 64 shuffle: true num_workers: 8 pin_memory: true use_start_end_token: false trim_silence: false max_duration: 16.7 min_duration: 0.1 shuffle_n: 2048 is_tarred: false tarred_audio_filepaths: /data/asr_set_1.4/train_no_appen/train_no_appen__OP_0..1023CL.tar

[NeMo W 2022-07-24 12:13:50 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath:

/data/librispeech_withsp2/manifests/librivox-dev-other.json
/data/librispeech_withsp2/manifests/librivox-dev-clean.json
/data/librispeech_withsp2/manifests/librivox-test-other.json
/data/librispeech_withsp2/manifests/librivox-test-clean.json sample_rate: 16000 batch_size: 64 shuffle: false num_workers: 8 pin_memory: true use_start_end_token: false is_tarred: false tarred_audio_filepaths: /data/asr_set_1.4/eval/eval__OP_0..1023CL.tar

[NeMo W 2022-07-24 12:13:50 modelPT:155] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath:

/data/librispeech_withsp2/manifests/librivox-test-other.json
/data/librispeech_withsp2/manifests/librivox-dev-clean.json
/data/librispeech_withsp2/manifests/librivox-dev-other.json
/data/librispeech_withsp2/manifests/librivox-test-clean.json sample_rate: 16000 batch_size: 64 shuffle: false num_workers: 8 pin_memory: true use_start_end_token: false is_tarred: false tarred_audio_filepaths: /data/asr_set_1.4/eval/eval__OP_0..1023CL.tar

[NeMo I 2022-07-24 12:13:50 features:200] PADDING: 0 [NeMo I 2022-07-24 12:13:50 save_restore_connector:243] Model EncDecCTCModelBPE was successfully restored from /home/david/dev/data2/NeMo/scripts/asr_language_modeling/ngram_lm/stt_en_conformer_ctc_medium.nemo. [NeMo I 2022-07-24 12:13:50 train_kenlm:105] Encoding the train file 'Mycorpus.txt' ... Read 300000 lines: : 304005 lines [00:00, 1069207.14 lines/s] Chunking 304005 rows into 37.1100 tasks (each chunk contains 8192 elements) [Parallel(n_jobs=-2)]: Using backend LokyBackend with 11 concurrent workers. [NeMo W 2022-07-24 12:13:54 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:54 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:54 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:54 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [NeMo W 2022-07-24 12:13:55 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out. [Parallel(n_jobs=-2)]: Done 3 tasks | elapsed: 13.2s [Parallel(n_jobs=-2)]: Done 10 tasks | elapsed: 14.2s [Parallel(n_jobs=-2)]: Done 21 out of 38 | elapsed: 23.6s remaining: 19.1s [Parallel(n_jobs=-2)]: Done 25 out of 38 | elapsed: 28.1s remaining: 14.6s [Parallel(n_jobs=-2)]: Done 29 out of 38 | elapsed: 30.1s remaining: 9.4s [Parallel(n_jobs=-2)]: Done 33 out of 38 | elapsed: 32.0s remaining: 4.9s [Parallel(n_jobs=-2)]: Done 38 out of 38 | elapsed: 34.0s finished Chunk : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 59.52 chunks/s] Finished writing 38 chunks to ./isd.arpa.tmp.txt. Current chunk index = 38 === 1/5 Counting and sorting n-grams === Reading /home/david/dev/data2/NeMo/scripts/asr_language_modeling/ngram_lm/isd.arpa.tmp.txt ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Unigram tokens 27851496 types 127 === 2/5 Calculating and sorting adjusted counts === Chain sizes: 1:1524 2:1236037376 3:2317570048 4:3708112128 5:5407663616 6:7416224256 Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5 Statistics: 1 127 D1=0.5 D2=1 D3+=1.5 2 11395 D1=0.446337 D2=0.961741 D3+=1.56052 3 295320 D1=0.513996 D2=0.998088 D3+=1.56068 4 2249977 D1=0.613141 D2=1.10596 D3+=1.59228 5 6909117 D1=0.733855 D2=1.18769 D3+=1.55571 6 12486979 D1=0.750966 D2=1.15849 D3+=1.46866 Memory estimate for binary LM: type MB probing 431 assuming -p 1.5 probing 485 assuming -r models -p 1.5 trie 162 without quantization trie 75 assuming -q 8 -b 8 quantization trie 144 assuming -a 22 array pointer compression trie 57 assuming -a 22 -q 8 -b 8 array pointer compression and quantization === 3/5 Calculating and sorting initial probabilities === Chain sizes: 1:1524 2:182320 3:5906400 4:53999448 5:193455276 6:399583328 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 4/5 Calculating and writing order-interpolated probabilities === Chain sizes: 1:1524 2:182320 3:5906400 4:53999448 5:193455276 6:399583328 ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 #################################################################################################### === 5/5 Writing ARPA model === ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Name:lmplz VmPeak:19807740 kB VmRSS:12340 kB RSSMax:3515160 kB user:18.5131 sys:4.38854 CPU:22.9017 real:18.23 [NeMo I 2022-07-24 12:14:44 train_kenlm:145] Running binary_build command

../../../../kenlm/build/bin/lmplz -o 6 --text ./isd.arpa.tmp.txt --arpa ./isd.arpa.tmp.arpa --discount_fallback

Reading ./isd.arpa.tmp.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Identifying n-grams omitted by SRI ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

Writing trie ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100

SUCCESS [NeMo I 2022-07-24 12:14:58 train_kenlm:158] Deleted the temporary encoded training file './isd.arpa.tmp.txt'. [NeMo I 2022-07-24 12:14:58 train_kenlm:160] Deleted the arpa file './isd.arpa.tmp.arpa'.

burchim commented 2 years ago

Hi David,

I think it's because your are not using the correct tokenizer to encode the corpus. The script seems to be using the default tokenizer of the stt_en_conformer_ctc_medium.nemo model. Once trained, the ngram statistics will not correspond to the efficient conformer outputs.

To resolve this problem I directly loaded the tokenizer using sentencepiece and modified the tokenize_str function in kenlm_utils.py.

Here are the modified files:

train_kenlm.txt kenlm_utils.txt

You will simply need to replace '--nemo_model_file' by '--tokenizer_path' to load the efficient conformer tokenizer. (LibriSpeech_bpe_256.model)

davyuan commented 2 years ago

that indeed turns out to be the case! thanks for sharing your customized train_lenlm and etc. Now the WER has shown significant improvement.

thanks! David

burchim commented 2 years ago

Nice happy to hear that !

Best, Maxime

burchim / EfficientConformer

Building ngram model from my own corpus #15