Data preparation for Language Modeling

flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit

https://github.com/facebookresearch/wav2letter/wiki

Other

6.35k stars 1.02k forks source link

Data preparation for Language Modeling #674

Open samin9796 opened 4 years ago

samin9796 commented 4 years ago

I have some questions regarding language model.

I want to build a character level LM. Previously I built a word level LM using KENLM and for that I created a text file with one sentence per line and there are thousands of such lines in that file. How to prepare dataset for a char level LM? Is there anything else that is different from building a word level LM?
To prepare data for Convlm, this command is supposed to be used.

source prepare_fairseq_data.sh [DATA_DST] [MODEL_DST] [FAIRSEQ PATH]

Here, can you please elaborate what it means by DATA_DST? Is it the path of the input text file that I prepared for the word level LM?

I am working with another language, not English. Does convert_convlm.sh work to convert in my case as well?

Thank you.

tlikhomanenko commented 4 years ago

Hi @samin9796,

1. I want to build a character level LM. Previously I built a word level LM using KENLM and for that I created a text file with one sentence per line and there are thousands of such lines in that file. How to prepare dataset for a char level LM? Is there anything else that is different from building a word level LM?

You need to split additionally each word into characters sequence separated by space. For example, if you had "hello world" sentence to build word-level LM, now to build letter-based LM you need to have "h e l l o | w o r l d |": so space in KenLM is treated as a separator between characters, also we are adding special token | to have words separation. You can do this in a bit different way like "_h e l l o _w o r l d" where you distinguish "start word" letters and letters inside word. This is up to you how to define characters vocabulary.

2\. To prepare data for Convlm, this command is supposed to be used.
source prepare_fairseq_data.sh [DATA_DST] [MODEL_DST] [FAIRSEQ PATH]

Here, can you please elaborate what it means by DATA_DST? Is it the path of the input text file that I prepared for the word level LM?

Here is what you need to call https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/lexicon_free/librispeech/prepare_fairseq_data.sh#L9 - to prepare the word-based data (the next command is for char, there is only data differs, parameters you can use the same, even call with or without --padding-factor 1). $FAIRSEQ is path to the fairseq cloned repo, $DATA_DST is prefix to the text corpus and $MODEL_DST path where the processed data will be stored. In your case for another language just call

python "path/to/fairseq/preprocess.py" --only-source \
--trainpref "path/to/your/text/corpus" \
--validpref "path/to/your/valid/text/corpus" \
--testpref "path/to/your/test/text/corpus" \
--destdir "path/to/the/folder/to/store/binarized/fairseq/data" \
--thresholdsrc 10 \ 
--padding-factor 1 \
--workers 16

here thresholdsrc means that words occuring in the text <=10 times will be mapped to unk.

1. I am working with another language, not English. Does convert_convlm.sh work to convert in my case as well?

For converting your model you use in the same way call of save_pytorch_model.py from the https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/lexicon_free/librispeech/convert_convlm.sh and then you need to fix the vocabulary dict (in case you use the same fairseq architecture): 221452 = is vocab size + 4, the same for character LMs 40 = vocab size + 4 (4 is for special fairseq tokens). So if you have 100k vocab during training then you need to put 100004. If you modify arch then you need to fix this one too https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/lexicon_free/librispeech/lm_librispeech_convlm_word_14B.arch. Here is the info about parameters on converting the model https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/utilities/convlm_serializer/Serialize.cpp#L22). Just let me know how do you plan to change fairseq training command then I can navigate you what command to use.

samin9796 commented 4 years ago

Hi @tlikhomanenko

This is the command I used for training:

*python3 train.py /data/ahnaf/fairseq_folder/models/first_try --save-dir /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_14B --task=language_modeling --arch=fconv_lm --fp16 --max-epoch=48 --optimizer=nag --lr=0.5 --lr-scheduler=fixed --decoder-embed-dim=128 --clip-norm=0.1 --decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 6 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]' --dropout=0.1 --weight-decay=1e-07 --max-tokens=1024 --tokens-per-sample=1024 --sample-break-mode=none --criterion=adaptive_loss --adaptive-softmax-cutoff='100,500,2000' --seed=42 --log-format=json --log-interval=100 --save-interval-updates=10000 --keep-interval-updates=10 --ddp-backend="no_c10d" --distributed-world-size=1 > /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_14B/train.log**

After the pre-processing step, dict.txt file was created and there are 49218 words in that file. So, total is 49222? What can I do next?

tlikhomanenko commented 4 years ago

@samin9796

1) run saving pytorch model into txt file

python3 wav2letter/recipes/models/utilities/convlm_serializer/save_pytorch_model.py /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_14B/checkpoint_best.pt /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_14B/checkpoint_best.weights

2) then convert to bin format with

wav2letter/build/recipes/models/utilities/convlm_serializer/SerializeConvLM \
  model.arch \
  /fairseq_folder/models/decoder/convlm_models/word_14B/checkpoint_best.weights \
  /fairseq_folder/models/decoder/convlm_models/word_14B/checkpoint_best.bin \
  49222 0 1 100,500,2000 2048
// last numbers means: total vocab size with special token included, adaptive softmax used for training, save adaptive softmax as an activation (we don't need criterion), adaptive-softmax-cutoff and the emb size before adaptive softmax

I modified mine arch with respect to your changes

--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]'

so create model.arch, put below arch there and use it in the command to convert into bin.

# input in format (t, b, 1, 1)
V -1 0 1 1
# after emb (c, t, b, 1)
E 128 NLABEL
DO 0.1
# after fc (c, t, b, 1)
WN 0 L 128 512
RO 1 3 0 2
# shape (t, 1, c, b)
# res conv1
RES 3 1 1
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
SKIP 0 4 0.7071
# res conv2
RO 2 0 3 1
# shape (c, t, b, 1)
RES 11 1 3
DO 0.1
WN 0 L 512 256
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 128 256 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 128 1024
GLU 0
SKIP 0 12 0.7071
# res conv3-1
RES 11 1 1
DO 0.1
WN 0 L 512 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 512 1024
# res conv3-(2,3)
RES 11 1 2
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv4-(2-6)
RES 11 1 5
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv5
RES 11 1 1
DO 0.1
WN 0 L 1024 2048
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 1024 2048 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 1024 4096
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 1024 2048
# shape (c, t, b, 1)

samin9796 commented 4 years ago

@tlikhomanenko I created a model.arch file as you mentioned and then ran the command. It shows following error:

F0605 06:28:50.317502 32772 Utils.cpp:291] mismatch between the number of parameters in the arch file and the weight file 137 model states vs 121 nn params + 7 criterion params Check failure stack trace: @ 0x7f59e8b8781d google::LogMessage::Fail() @ 0x7f59e8b899d1 google::LogMessage::SendToLog() @ 0x7f59e8b8734d google::LogMessage::Flush() @ 0x7f59e8b8a389 google::LogMessageFatal::~LogMessageFatal() @ 0x559647adf411 loadConvLM() @ 0x559647a24fe5 main @ 0x7f59c2921b97 __libc_start_main @ 0x559647a7534a _start Aborted (core dumped)

This is the train.log file after training with fairseq: train.log

tlikhomanenko commented 4 years ago

Sorry, a bit wrong arch, could you try to use this one?

# input in format (t, b, 1, 1)
V -1 0 1 1
# after emb (c, t, b, 1)
E 128 NLABEL
DO 0.1
# after fc (c, t, b, 1)
WN 0 L 128 512
RO 1 3 0 2
# shape (t, 1, c, b)
# res conv1
RES 3 1 1
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
SKIP 0 4 0.7071
# res conv2
RO 2 0 3 1
# shape (c, t, b, 1)
RES 11 1 3
DO 0.1
WN 0 L 512 256
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 128 256 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 128 1024
GLU 0
SKIP 0 12 0.7071
# res conv3-1
RES 11 1 1
DO 0.1
WN 0 L 512 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 512 1024
# res conv3-(2,3)
RES 11 1 2
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv4-(1-6)
RES 11 1 6
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv5
RES 11 1 1
DO 0.1
WN 0 L 1024 2048
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 1024 2048 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 1024 4096
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 1024 2048
# shape (c, t, b, 1)

samin9796 commented 4 years ago

@tlikhomanenko Thanks a lot. It worked. However, when I try to decode 6500 samples (20 hours), it gets stuck. This is from the terminal:

Skipping unknown entry: 'কম্পিটিশনে' Skipping unknown entry: 'যুগাযুগ' Skipping unknown entry: 'ভজ্য' Skipping unknown entry: 'সেইফ' Skipping unknown entry: '৩..১' Skipping unknown entry: 'শুতেই' Skipping unknown entry: 'যুযুৎসু' Skipping unknown entry: 'টিটিতে' Skipping unknown entry: 'হাই্ডোজেন' Skipping unknown entry: 'বেম্বা' Skipping unknown entry: 'ধুলা-ময়লা' Skipping unknown entry: 'প্রাচীরগুলি' Skipping unknown entry: 'আঁঠালো' I0606 00:36:57.688994 336 Decode.cpp:247] [Decoder] LM constructed. I0606 00:37:04.938005 336 Decode.cpp:271] [Decoder] Trie planted. I0606 00:37:04.955726 336 Decode.cpp:283] [Decoder] Trie smeared. I0606 00:37:05.138589 336 W2lListFilesDataset.cpp:141] 6533 files found. I0606 00:37:05.138716 336 Utils.cpp:102] Filtered 0/6533 samples I0606 00:37:05.139793 336 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 6533

Then I waited for almost 1 hour but nothing changed. I used 8 GPUs for decoding. If I reduce the test sample to only 70, then no problem occurs during decoding. This is the decode.cfg:

--am=/data/ahnaf/wav2letter/dataset_prep/all_models/wer_19/here/001_model_validation.lst.bin --lexicon=/data/ahnaf/wav2letter/dataset_prep/lexicon.txt --datadir=/data/ahnaf/wav2letter/dataset_prep/ --tokensdir=/data/ahnaf/wav2letter/dataset_prep/ --tokens=tokens_normal.txt --test=test.lst --decodertype=wrd --lm=/data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_14B/checkpoint_best.bin --lmweight=1.5670620758180553 --wordscore=2.4475127898444944 --beamsize=50 --beamthreshold=10 --silweight=-1.1415291788128683 --nthread_decoder=8 --smearing=max --show=true --showletters=true --lmtype=convlm --lm_memory=300 --lm_vocab=/data/ahnaf/fairseq_folder/models/first_try/dict.vocab --sclite=/data/ahnaf/wav2letter/dataset_prep/all_models/failure/

tlikhomanenko commented 4 years ago

What is the output for 70 samples (I mean there should be time for decoding printed on the screen)? How many tokens do you have in the tokens file? What the criterion did you train: ctc or s2s?

samin9796 commented 4 years ago

@tlikhomanenko For 70 samples, actual decoding time is 270.871sec and it took 3.8 sec per sample with nthread decoder=1. I have 178 tokens in the tokens file. I used ctc criterion to train the acoustic model.

tlikhomanenko commented 4 years ago

Looks fine. You can try to set beamsize=10, beamsizetoken=10 and check on the full dataset. There could be some delay in adding to the log. When you said it gets stuck the log is not updated during what time?

samin9796 commented 4 years ago

I waited for more or less an hour but found nothing at terminal. I just checked one or two times in the log file. One more thing, I built a small language model to see how it goes on. Now that I am building a large LM with more than 400000 vocabulary size, it is taking 13 hours to train the LM with only one GPU and still in epoch 1. I couldn't increase the number of GPU as I have some power supply unit issues when arch is large. Could you please elaborate this argument so that I can modify arch myself and try out different models?

--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]'

tlikhomanenko commented 4 years ago

Still strange, you can try to run with beamthreashold=5 or even less to see what will happen. Do you have very large input? I would like to see the hist of input sizes (seems problem on some particular input - here you could try to add additional logging to debug what sample is slow).

About LM: [..]*3 - means block and repeat block 3 times, (out_feature, kernel, residual) - conv block with out features, kernel and which residual (layer with [-residual]) you have https://github.com/pytorch/fairseq/blob/89a2a0ccdebd0943b2878ff2150f8a5f836cc4aa/fairseq/models/fconv.py#L389.

--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 6 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]'

so here all block are residual.

I would start to reduce number of layer like [..]3 -> [..]1 and then feature map sizes.

samin9796 commented 4 years ago

@tlikhomanenko
I have only 10-15 sec inputs. I am going to test with beamthreashold=5 and then let you know.

About LM: I trained a new LM using the following command:

python3 train.py /data/ahnaf/fairseq_folder/models/full --save-dir /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_16B --task=language_modeling --arch=fconv_lm --max-epoch=48 --optimizer=nag --lr=0.4 --lr-scheduler=fixed --decoder-embed-dim=128 --clip-norm=0.1 --decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] 3 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 2 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 1 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]' --dropout=0.1 --weight-decay=1e-07 --max-tokens=512 --tokens-per-sample=512 --sample-break-mode=none --criterion=adaptive_loss --adaptive-softmax-cutoff='10000,50000,200000' --seed=42 --log-format=json --log-interval=1000 --save-interval-updates=10000 --keep-interval-updates=-1 --no-epoch-checkpoints --ddp-backend="no_c10d" --distributed-world-size=6 > /data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_16B/train.log

I would start to reduce number of layer like [..]*3 -> [..]*1 and then feature map sizes.

Is this the same arch according to your above comment? I am still facing difficulty while writing the model.arch file. Could you please modify the previous arch file? And what was the perplexity or loss value of the language model presented in your paper? What value of loss or perplexity is needed to get a competitive WER?

tlikhomanenko commented 4 years ago

@samin9796

Is this the same arch according to your above comment? I am still facing difficulty while writing the model.arch file. Could you please modify the previous arch file?

Yes, this one of the arch you can try. Or this one "--decoder-layers='[(512, 5)] + [(128, 1, 0), (128, 5, 0), (512, 1, 3)] 1 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] 1 + [(512, 1, 0), (512, 5, 0), (1024, 1, 3)] * 3 + [(1024, 1, 0), (1024, 5, 0), (2048, 1, 3)]'".

For your arch w2l arch file will be like (please try to understand and ask me more questions on arch then you can simply adapt to all your experiments: first you can check what I change for your latest modification):

# input in format (t, b, 1, 1)
V -1 0 1 1
# after emb (c, t, b, 1)
E 128 NLABEL
DO 0.1
# after fc (c, t, b, 1)
WN 0 L 128 512
RO 1 3 0 2
# shape (t, 1, c, b)
# res conv1
RES 3 1 1
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
SKIP 0 4 0.7071
# res conv2
RO 2 0 3 1
# shape (c, t, b, 1)
RES 11 1 3
DO 0.1
WN 0 L 512 256
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 128 256 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 128 1024
GLU 0
SKIP 0 12 0.7071
# res conv3-1
RES 11 1 1
DO 0.1
WN 0 L 512 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 512 1024
# res conv3-(2)
RES 11 1 1
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv4-(1)
RES 11 1 1
DO 0.1
WN 0 L 1024 1024
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 512 1024 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 512 2048
GLU 0
SKIP 0 12 0.7071
# res conv5
RES 11 1 1
DO 0.1
WN 0 L 1024 2048
GLU 0
RO 1 3 0 2
DO 0.1
WN 3 AC 1024 2048 5 1 -1 0
GLU 2
RO 2 0 3 1
DO 0.1
WN 0 L 1024 4096
GLU 0
SKIPL 0 12 1 0.7071
WN 0 L 1024 2048
# shape (c, t, b, 1)

And what was the perplexity or loss value of the language model presented in your paper? What value of loss or perplexity is needed to get a competitive WER?

Perplexity you can find in the table 2 of https://research.fb.com/wp-content/uploads/2019/09/Who-Needs-Words-Lexicon-Free-Speech-Recognition.pdf. With respect to ngram which perplexity is around 150 you can expect with gcnn to have around 60 on Librispeech. So you can try to get twice less perplexity than 4-gram LM on your data (or significantly better than 4-gram). In my experience improving perplexity mostly leads to improving the WER, for example you can have a look at the Figure 2 in the https://arxiv.org/pdf/1812.06864.pdf, which did study of this dependance. If you are already at the very low perplexity for the chosen architecture that it is hard to improve and you need some tricks/a lot of time of parameters tuning then improving on 2-5 perplexity would not possibly lead to the WER improvement.

samin9796 commented 4 years ago

@tlikhomanenko Using Convlm I get WER 12.3%, 4 gram LM results in WER 6.3 %. Without applying any LM, I get WER 20%. My training ppl is around 70 and validation ppl is 125. I ran 5 epochs and it seemed that ppl is not significantly decreasing for a long time. Vocab size is 716903. I tried out different lmweights and found out that lmweight affects the result significantly. But WER 12.3% seems to be the best that I can get. Here is the decode.cfg:

--am=/data/ahnaf/wav2letter/dataset_prep/all_models/different_layers/layers_18/001_model_validation.lst.bin --lexicon=/data/ahnaf/wav2letter/dataset_prep/lexicon.txt --datadir=/data/ahnaf/wav2letter/dataset_prep/ --tokensdir=/data/ahnaf/wav2letter/dataset_prep/ --tokens=tokens_normal.txt --test=small_test.lst --decodertype=wrd --lm=/data/ahnaf/fairseq_folder/models/decoder/convlm_models/word_16B/checkpoint_best.bin --lmweight=0.4 --wordscore=0.461496 --beamsize=30 --beamsizetoken=30 --beamthreshold=10 --eosscore=2 --silweight=-1.231907 --nthread_decoder=6 --smearing=max --show=true --showletters=true --lmtype=convlm --lm_memory=3000 --lm_vocab=/data/ahnaf/fairseq_folder/models/full/dict.vocab

When I try to decode 6500 samples, the problem is still there even if I set beamtheashold as 5. I waited for 15 minutes but nothing on the log file.

This is the input file:

test1 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a0/TNF01_oishi_p_1_S_1-01.wav 12390.0 ০১কথোপকথন তুই নাকি আরব দেশে যাচ্ছিস বাদশা কিছু বলে না বোন তার হাত ধরে আবার বলে সত্যি বাদশাকে ভাত বেড়ে দিয়ে বলে তলে তলে কখন তুই ঠিক করলি test2 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-09.wav 13470.0 গত মঙ্গলবার বিকেলে সোনালি রোদ্দুরে একঝাঁক মেয়ের একাগ্র অনুশীলন দেখেই বোঝা যায় এটাই সেই মাঠ যেখানে ফুটবলার হওয়ার প্রথম দীক্ষা পেয়েছে তহুরা মারিয়া শামসুন্নাহার মার্জিয়ারা test3 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-30.wav 6560.0 নাম প্রকাশে অনিচ্ছুক এক শিক্ষক বললেন সবাই সুনামের ভাগীদার হতে চায় এ থেকেই তৈরি হয়েছে বিভেদ দ্বন্দ্ব test4 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-31.wav 10880.0 এই দ্বন্দ্বে যে ফুটবলাররা ক্ষতিগ্রস্ত হচ্ছে পিছিয়ে পড়ছে কলসিন্দুরের ফুটবল সেটির প্রমাণ গত দুই বছরে বঙ্গমাতা ফজিলাতুন্নেছা গোল্ড কাপে তাদের ফলাফল ভালো ছিল না test5 /data/ahnaf/speech_dataset/our_dataset/part1/wav1/a1/TNF01_OISHI_p_2_10_S_1-36.wav 9980.0 তিনি বলেন মফিজ উদ্দিনকে প্রধান কোচ ও জুয়েল মিয়াকে সহকারী কোচ করা হয়েছিল কিন্তু এরপর তিনি দীর্ঘদিন দেশের বাইরে থাকায় সর্বশেষ অবস্থা জানতে পারেননি

tlikhomanenko commented 4 years ago

Several questions:

what acoustic model did you train? s2s or CTC?
what is validation perplexity of 4gram compared to the convlm?
advice: try to set 200k vocab, 700k is to large, possibly you don't need them.
can you send validation perplexity over training for convlm?
why do you have so huge gap between Viterbi and 4gram decoding?
just to recheck: do you train word-based LM?

tlikhomanenko commented 4 years ago

And yes, all decoder parameters are sensitive for switching from ngram to convlm, like lm weight, wordscore, eosscore (for s2s), silweight (for asg).

Possibly the problem with decoding exactly in the vocab size you used to train convlm. Can you train model with 200k vocab let's say (choose top words) and try with it to decode?

samin9796 commented 4 years ago

@tlikhomanenko Thank you for your reply. I trained AM with CTC and trained word-based LM. I don't have answers to the other questions yet but I will let you know. I can train a convlm with 200k vocab and see how it goes.

tlikhomanenko commented 4 years ago

Then set --silweight=0 and optimise the wordscore in range (-3, 3) and lmweight. They will be totally different from ngram model decoding.

samin9796 commented 3 years ago

@tlikhomanenko Hi! I trained a convLM with 273k vocab and found out that n-gram is still better by a large margin. Then I experimented with a smaller text corpus (only the text files from my speech corpus, 70k vocab) and built n-gram and convlm. This time also, WER using n-gram is around 3.63% while using convlm I get around 4.2% WER. Finally, I used the same beam size, beam size token, beam threshold for both language models and set smearing to none and only then convlm outperformed n-gram lm. So, to get a better result from convlm, I had to reduce those parameters for n-gram LM. Well, is it possible that for certain languages, n-gram LM is better than convLM?

tlikhomanenko commented 3 years ago

For me this is really very suspicious. There could be some corner cases, like for Librivox models ngram and convlm gives close results. Could you send what is the perplexity of your models, for ngram and convlm?

Do you have word-piece AM or letter based?

We often use twice smaller beam for convlm and smaller beamthr and beamsizetoken due to efficiency for convlm, and in all cases we got better results with convlm, but search was extensive for the parameters. Could you post your settings of decoder for both cases and ranges for your random search and how many tries you did?