alphacep / vosk-android-demo

Offline speech recognition for Android with Vosk library.
Apache License 2.0
734 stars 194 forks source link

Tuning an existing model #110

Closed OscarVanL closed 3 years ago

OscarVanL commented 3 years ago

Hi, Firstly thank you for this demo, it works very well.

I hope to create an ASR model for a type of non-typical speech, British-accented patients that speak using an electrolarynx.

My plan was to do the following:

  1. Take an existing trained American-accent English model
  2. Tune this with a British speakers dataset, a subset of LibriTTS, to capture the British accent.
  3. Tune this again, this time on a smaller dataset of recordings like the sample above. The size of this dataset is approx. 5 hours.

I have been exploring the mini_librispeech Kaldi example, which you say is the proper way to train a compatible model for Vosk, but am not sure how to tune an existing model.

Do you have any recommendations or scripts for tuning an existing Kaldi model so that it is compatible with Vosk?

Thank you!

nshmyrev commented 3 years ago

In general, you can simply train instead of all those fine-tuning. American model gonna be useless for you anyway.

From your sample I didn't understand - is there a single voice in the end or all patients sound differently, just a bit metallic?

OscarVanL commented 3 years ago

Some unique characteristics for each patient will come through. But mostly they will have a metallic voice.

The reason I don't just train from scratch with the patient data is I feared there would be too little data (5 hours), so I thought with tuning I could get better generalisation. In this case, what would you suggest?

Perhaps I could combine some of my "normal" British speakers dataset with the patient speech and train on that, but I feared the speech might be too different for this to be effective.

nshmyrev commented 3 years ago

Since the model is tiny it doesn't need that much data, something like 200-300 hours is enough. I would get British dataset first (get one from youtube or filter tedlium speakers, filter librispeech speakers), something like 100 hours is enough, then I would mix your specific data and run augmentation.

Maybe you need to spend more time on augmentation to try to create a similar voice to your samples, that will help. For training you can take mini_librispeech recipe.

OscarVanL commented 3 years ago

I have already filtered LibriTTS (Librispeech) to get about 30 hours of British speakers, there is also the ARU speech corpus.

It looks like I will still need a bit more data though. Thanks for the suggestions.

On the models documentation page it says: _Latest minilibrispeech uses online cmvn which we do not support yet. Use this script to train nnet3 model.

Is this still necessary?

nshmyrev commented 3 years ago

Is this still necessary?

Yes

OscarVanL commented 3 years ago

Is this still necessary?

Yes

Thanks, I presume this just means changing this line:

https://github.com/kaldi-asr/kaldi/blob/403de7da872529122b5f64cc8dec55410223b171/egs/mini_librispeech/s5/run.sh#L144

To call run_tdnn_1j.sh?

nshmyrev commented 3 years ago

Correct.

OscarVanL commented 3 years ago

I've trained the mini_librispeech example with the above change.

I want to test this model on the app to make sure it works.

I need to build the model structure and am following the instructions here, but there are some ambiguities for which files I need to use.

I've listed all the ambiguities in my mini_librispeech folder and bolded the ones I think I should use. Please could you help advise which I'm supposed to use? :)

am/final.mdl

conf/mfcc.conf

conf/model.conf:

ivector/final.dubm

ivector/final.ie

ivector/final.mat

ivector/splice.conf (splice.conf is not present in the ivector_extractor folder.)

ivector/global_cmvn.stats

ivector/online_cmvn.conf

graph/phones/word_boundary.int (I have no idea...)

graph/HCLG.fst

graph/HCLr.fst

graph/Gr.fst

graph/phones.txt (no idea, there are so many...)

graph/words.txt (Also no idea...)

rescore/G.carpa

rescore/G.fst (no idea)

Thank you!

nshmyrev commented 3 years ago
#!/bin/bash

# path to directory where model path will placed

dir="${1:-$HOME}"
echo "$dir"
if [ ! -d "$dir/model" ]
then
  mkdir -p "$dir/model/ivector"
fi

cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.dubm "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.ie "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.mat "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/global_cmvn.stats "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/online_cmvn.conf "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/splice_opts "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/conf/splice.conf "$dir/model/ivector"

cp exp/chain/tdnn1*_sp_online/conf/mfcc.conf "$dir/model"
cp exp/chain/tdnn1*_sp_online/final.mdl "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/HCLG.fst "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/words.txt "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/phones/word_boundary.int "$dir/model"
OscarVanL commented 3 years ago

Thank you!

OscarVanL commented 3 years ago

Great, I tried the trained model on your Android example app and it works. I'm excited to train the full model next :D Thanks for the help!

nshmyrev commented 3 years ago

Ok, you can read https://alphacephei.com/nsh/2020/03/27/lookahead.html about Gr.fst

OscarVanL commented 3 years ago

Hi, I've adapted the mini_librispeech run.sh to use my own dataset, but when it reaches our modified train line: local/chain/tuning/run_tdnn_1j.sh it fails.

run.sh output from run_tdnn_1j.sh ``` local/chain/tuning/run_tdnn_1j.sh: creating neural net configs using the xconfig parser tree-info exp/chain/tree_sp/tree steps/nnet3/xconfig_to_configs.py --xconfig-file exp/chain/tdnn1j_sp/configs/network.xconfig --config-dir exp/chain/tdnn1j_sp/configs/ nnet3-init exp/chain/tdnn1j_sp/configs//ref.config exp/chain/tdnn1j_sp/configs//ref.raw LOG (nnet3-init[5.5.851~1-088e9]:main():nnet3-init.cc:80) Initialized raw neural net and wrote it to exp/chain/tdnn1j_sp/configs//ref.raw nnet3-info exp/chain/tdnn1j_sp/configs//ref.raw nnet3-init exp/chain/tdnn1j_sp/configs//ref.config exp/chain/tdnn1j_sp/configs//ref.raw LOG (nnet3-init[5.5.851~1-088e9]:main():nnet3-init.cc:80) Initialized raw neural net and wrote it to exp/chain/tdnn1j_sp/configs//ref.raw nnet3-info exp/chain/tdnn1j_sp/configs//ref.raw 2020-12-20 03:04:33,132 [steps/nnet3/chain/train.py:35 - - INFO ] Starting chain model trainer (train.py) 2020-12-20 03:04:33,136 [steps/nnet3/chain/train.py:281 - train - INFO ] Arguments for the experiment {'alignment_subsampling_factor': 3, 'apply_deriv_weights': False, 'backstitch_training_interval': 1, 'backstitch_training_scale': 0.0, 'chunk_left_context': 0, 'chunk_left_context_initial': -1, 'chunk_right_context': 0, 'chunk_right_context_final': -1, 'chunk_width': '140,100,160', 'cleanup': True, 'cmvn_opts': '--norm-means=false --norm-vars=false', 'combine_sum_to_one_penalty': 0.0, 'command': 'run.pl --mem 4G', 'compute_per_dim_accuracy': False, 'deriv_truncate_margin': None, 'dir': 'exp/chain/tdnn1j_sp', 'do_final_combination': True, 'dropout_schedule': None, 'egs_command': None, 'egs_dir': None, 'egs_nj': 0, 'egs_opts': '--frames-overlap-per-eg 0', 'egs_stage': 0, 'email': None, 'exit_stage': None, 'feat_dir': 'data/Laryngectomy_ASR_Train_sp_hires', 'final_effective_lrate': 0.0002, 'frame_subsampling_factor': 3, 'frames_per_iter': 3000000, 'initial_effective_lrate': 0.002, 'input_model': None, 'l2_regularize': 0.0, 'lat_dir': 'exp/chain/tri3b_Laryngectomy_ASR_Train_sp_lats', 'leaky_hmm_coefficient': 0.1, 'left_deriv_truncate': None, 'left_tolerance': 5, 'lm_opts': '--num-extra-lm-states=2000', 'max_lda_jobs': 10, 'max_models_combine': 20, 'max_objective_evaluations': 30, 'max_param_change': 2.0, 'momentum': 0.0, 'num_chunk_per_minibatch': '128,64', 'num_epochs': 20.0, 'num_jobs_final': 5, 'num_jobs_initial': 2, 'num_jobs_step': 1, 'online_ivector_dir': 'exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires', 'preserve_model_interval': 100, 'presoftmax_prior_scale_power': -0.25, 'proportional_shrink': 0.0, 'rand_prune': 4.0, 'remove_egs': True, 'reporting_interval': 0.1, 'right_tolerance': 5, 'samples_per_iter': 400000, 'shrink_saturation_threshold': 0.4, 'shrink_value': 1.0, 'shuffle_buffer_size': 5000, 'srand': 0, 'stage': -10, 'train_opts': ['--optimization.memory-compression-level=2'], 'tree_dir': 'exp/chain/tree_sp', 'use_gpu': 'wait', 'xent_regularize': 0.1} 2020-12-20 03:04:36,507 [steps/nnet3/chain/train.py:338 - train - INFO ] Creating phone language-model 2020-12-20 03:04:41,872 [steps/nnet3/chain/train.py:343 - train - INFO ] Creating denominator FST copy-transition-model exp/chain/tree_sp/final.mdl exp/chain/tdnn1j_sp/0.trans_mdl LOG (copy-transition-model[5.5.851~1-088e9]:main():copy-transition-model.cc:62) Copied transition model. 2020-12-20 03:04:42,640 [steps/nnet3/chain/train.py:379 - train - INFO ] Generating egs steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd run.pl --mem 4G --cmvn-opts --norm-means=false --norm-vars=false --online-ivector-dir exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires --left-context 30 --right-context 30 --left-context-initial -1 --right-context-final -1 --left-tolerance 5 --right-tolerance 5 --frame-subsampling-factor 3 --alignment-subsampling-factor 3 --stage 0 --frames-per-iter 3000000 --frames-per-eg 140,100,160 --srand 0 data/Laryngectomy_ASR_Train_sp_hires exp/chain/tdnn1j_sp exp/chain/tri3b_Laryngectomy_ASR_Train_sp_lats exp/chain/tdnn1j_sp/egs steps/nnet3/chain/get_egs.sh: File data/Laryngectomy_ASR_Train_sp_hires/utt2uniq exists, so ensuring the hold-out set includes all perturbed versions of the same source utterance. steps/nnet3/chain/get_egs.sh: Holding out 300 utterances in validation set and 300 in training diagnostic set, out of total 197415. steps/nnet3/chain/get_egs.sh: creating egs. To ensure they are not deleted later you can do: touch exp/chain/tdnn1j_sp/egs/.nodelete steps/nnet3/chain/get_egs.sh: feature type is raw, with 'apply-cmvn' tree-info exp/chain/tdnn1j_sp/tree feat-to-dim scp:exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires/ivector_online.scp - steps/nnet3/chain/get_egs.sh: working out number of frames of training data steps/nnet3/chain/get_egs.sh: working out feature dim steps/nnet3/chain/get_egs.sh: creating 39 archives, each with 18399 egs, with steps/nnet3/chain/get_egs.sh: 140,100,160 labels per example, and (left,right) context = (30,30) steps/nnet3/chain/get_egs.sh: Getting validation and training subset examples in background. steps/nnet3/chain/get_egs.sh: Generating training examples on disk steps/nnet3/chain/get_egs.sh: Getting subsets of validation examples for diagnostics and combination. steps/nnet3/chain/get_egs.sh: recombining and shuffling order of archives on disk run.pl: 16 / 39 failed, log is in exp/chain/tdnn1j_sp/egs/log/shuffle.*.log Traceback (most recent call last): File "steps/nnet3/chain/train.py", line 644, in main train(args, run_opts) File "steps/nnet3/chain/train.py", line 405, in train stage=args.egs_stage) File "steps/libs/nnet3/train/chain_objf/acoustic_model.py", line 118, in generate_chain_egs egs_opts=egs_opts if egs_opts is not None else '')) File "steps/libs/common.py", line 129, in execute_command p.returncode, command)) Exception: Command exited with status 1: steps/nnet3/chain/get_egs.sh --frames-overlap-per-eg 0 --cmd "run.pl --mem 4G" --cmvn-opts "--norm-means=false --norm-vars=false" --online-ivector-dir "exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires" --left-context 30 --right-context 30 --left-context-initial -1 --right-context-final -1 --left-tolerance '5' --right-tolerance '5' --frame-subsampling-factor 3 --alignment-subsampling-factor 3 --stage 0 --frames-per-iter 3000000 --frames-per-eg 140,100,160 --srand 0 data/Laryngectomy_ASR_Train_sp_hires exp/chain/tdnn1j_sp exp/chain/tri3b_Laryngectomy_ASR_Train_sp_lats exp/chain/tdnn1j_sp/egs steps/nnet3/chain/train.py --stage=-10 --cmd=run.pl --mem 4G --feat.online-ivector-dir=exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires --feat.cmvn-opts=--norm-means=false --norm-vars=false --chain.xent-regularize 0.1 --chain.leaky-hmm-coefficient=0.1 --chain.l2-regularize=0.0 --chain.apply-deriv-weights=false --chain.lm-opts=--num-extra-lm-states=2000 --trainer.add-option=--optimization.memory-compression-level=2 --trainer.srand=0 --trainer.max-param-change=2.0 --trainer.num-epochs=20 --trainer.frames-per-iter=3000000 --trainer.optimization.num-jobs-initial=2 --trainer.optimization.num-jobs-final=5 --trainer.optimization.initial-effective-lrate=0.002 --trainer.optimization.final-effective-lrate=0.0002 --trainer.num-chunk-per-minibatch=128,64 --egs.chunk-width=140,100,160 --egs.dir= --egs.opts=--frames-overlap-per-eg 0 --cleanup.remove-egs=true --use-gpu=wait --reporting.email= --feat-dir=data/Laryngectomy_ASR_Train_sp_hires --tree-dir=exp/chain/tree_sp --lat-dir=exp/chain/tri3b_Laryngectomy_ASR_Train_sp_lats --dir=exp/chain/tdnn1j_sp ['steps/nnet3/chain/train.py', '--stage=-10', '--cmd=run.pl --mem 4G', '--feat.online-ivector-dir=exp/nnet3/ivectors_Laryngectomy_ASR_Train_sp_hires', '--feat.cmvn-opts=--norm-means=false --norm-vars=false', '--chain.xent-regularize', '0.1', '--chain.leaky-hmm-coefficient=0.1', '--chain.l2-regularize=0.0', '--chain.apply-deriv-weights=false', '--chain.lm-opts=--num-extra-lm-states=2000', '--trainer.add-option=--optimization.memory-compression-level=2', '--trainer.srand=0', '--trainer.max-param-change=2.0', '--trainer.num-epochs=20', '--trainer.frames-per-iter=3000000', '--trainer.optimization.num-jobs-initial=2', '--trainer.optimization.num-jobs-final=5', '--trainer.optimization.initial-effective-lrate=0.002', '--trainer.optimization.final-effective-lrate=0.0002', '--trainer.num-chunk-per-minibatch=128,64', '--egs.chunk-width=140,100,160', '--egs.dir=', '--egs.opts=--frames-overlap-per-eg 0', '--cleanup.remove-egs=true', '--use-gpu=wait', '--reporting.email=', '--feat-dir=data/Laryngectomy_ASR_Train_sp_hires', '--tree-dir=exp/chain/tree_sp', '--lat-dir=exp/chain/tri3b_Laryngectomy_ASR_Train_sp_lats', '--dir=exp/chain/tdnn1j_sp'] ```

Here are some of the logs from exp/chain/tdnn1j_sp/egs/log/shuffle.*.log:

nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat  exp/chain/td$
nnet3-chain-shuffle-egs --srand=20 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.20.ark
ERROR: CompactFst write failed: <unknown>
ERROR (nnet3-chain-normalize-egs[5.5.851~1-088e9]:WriteToken():io-funcs.cc:141) Write failure in WriteToken.
bash: line 1: 14346 Aborted                 (core dumped) nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst "ark:cat  exp/chain/tdnn1j_sp/egs/cegs_orig.1.20.ark exp/chain/tdnn1j_sp/egs/cegs_ors_orig.75.20.ark|" ark:-
     14348 Killed                  | nnet3-chain-shuffle-egs --srand=$[20+0] ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.20.ark
nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat  exp/chain/tdnn1j_sp/egs/cegs_orig.1.25.ark exp/chain/tdnn1j_sp/egs/cegs_orig.2.25.ark exp/chain/tdnn1j_sp/egs/cegs_orig.3.25.ark exp$
nnet3-chain-shuffle-egs --srand=25 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.25.ark
ERROR (nnet3-chain-normalize-egs[5.5.851~1-088e9]:Write():compressed-matrix.cc:563) Error writing compressed matrix to stream.
nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat  exp/chain/tdnn1j_sp/egs/cegs_orig.1.29.ark exp/chain/tdnn1j_sp/egs/cegs_orig.2.29.ark exp/chain/tdnn1j_sp/egs/cegs_orig.3.29.ark exp$
nnet3-chain-shuffle-egs --srand=29 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.29.ark
WARNING (nnet3-chain-normalize-egs[5.5.851~1-088e9]:main():nnet3-chain-normalize-egs.cc:84) For example sp1.1-061360_00840500361-96, FST was empty after composing with normalization FST. This should be extremely rare (a few per corpus, at most)
LOG (nnet3-chain-normalize-egs[5.5.851~1-088e9]:main():nnet3-chain-normalize-egs.cc:94) Added normalization to 21131 egs; had errors on 1
LOG (nnet3-chain-shuffle-egs[5.5.851~1-088e9]:main():nnet3-chain-shuffle-egs.cc:104) Shuffled order of 21131 neural-network training examples

Stages 1-8 of run.sh seem to complete without any problems, but the stage 9 training (where we modified the script) fails. Any ideas?

nshmyrev commented 3 years ago

Most likely it is out of disk space.

OscarVanL commented 3 years ago

Oh, wow. I had 250GB before I started this! I will check that 🤭

OscarVanL commented 3 years ago

I tried freeing up some space and trying again. I started training with over 500GB of free storage and it still failed. I checked the free system space with df at different points while the script was running and there are still hundreds of GB free.

I don't think storage space was the cause of the failure.

nshmyrev commented 3 years ago

Maybe something is wrong with that particular utterance sp1.1-061360_00840500361-96. Did you check it, is it ok?

Core dump is something that is better reported in Kaldi github.

OscarVanL commented 3 years ago

Thanks for the suggestion. In the train dataset this file is fine. You can download it here.

I have checked the file, and it is the same sampling rate, number of bits, mono as the rest of the dataset.

A number of errors like that appeared for other speech samples too (although the number of occurrences was small in comparison to the samples in the dataset.)

Since this is a Kaldi issue rather than a Vosk one, I made a more detailed issue here. It includes my train script, and a zip of all the logs.

OscarVanL commented 3 years ago

Hi,

I'm getting a new error now during training:

run.sh logs:

steps/nnet3/chain/get_egs.sh: Finished preparing training examples
2020-12-30 04:40:38,883 [steps/nnet3/chain/train.py:428 - train - INFO ] Copying the properties from exp/chain/tdnn1j_sp/egs to exp/chain/tdnn1j_sp
2020-12-30 04:40:38,908 [steps/nnet3/chain/train.py:451 - train - INFO ] Preparing the initial acoustic model.
2020-12-30 04:40:41,200 [steps/nnet3/chain/train.py:485 - train - INFO ] Training will run for 20.0 epochs = 668 iterations
2020-12-30 04:40:41,204 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 0/667   Jobs: 2   Epoch: 0.00/20.0 (0.0% complete)   lr: 0.004000
run.pl: job failed, log is in exp/chain/tdnn1j_sp/log/compute_prob_train.0.log
run.pl: job failed, log is in exp/chain/tdnn1j_sp/log/compute_prob_valid.0.log

exp/chain/tdnn1j_sp/log/compute_prob_train.0.log logs:

ERROR (nnet3-chain-compute-prob[5.5.851~1-088e9]:AcceptInput():nnet-compute.cc:561) Num-cols mismatch for input 'ivector': 30 in computation-request, 100 provided.

Immediately the number 30 caught my attention, in your "Training your own model" documentation you say:

Train ivector of dim 30 instead of standard 100 to save memory of mobile models.

Am I correct in thinking that the value ivector_dim=100 in steps/online/nnet2/train_ivector_extractor.sh needs to be changed to ivector_dim=30 to fix this error?

Edit: Yes, this did fix it :)

Thanks :)

OscarVanL commented 3 years ago

Hi,

I've got a model trained successfully and it works, thanks for the help. I have a question.

When there are background noises, <UNK> is sometimes added to the transcription using my models, whereas it's not in the default model. Is this something I can disable? I'd rather no text be added to the transcription if there are unusual or unrecognised noises.

Thanks :)

OscarVanL commented 3 years ago

One more thing, my model size is much larger than yours.

Your bundled model is 50.5MB, my model is 224MB.

My model seems to perform fine, so there are no practical issues introduced by this size increase, but is this something I should expect to see? Obviously, I would rather the file size be smaller for use on a mobile app, but not at the expense of performance.

In addition, how many Epochs do you think would be good for a dataset of this size? Someone on the Kaldi Help forum thought 20 was far too many.

RoxanaTapia commented 3 years ago

Hello @OscarVanL, I'm trying to train a model with my own data as well to use in the Android app. I have prepared my audio data that consists of public speeches. I split the audio files into chunks and prepared the spk2gender, wav.scp, text, utt2spk, and corpus.txt of de data preparation part. I'm unsure what LM could I use to apply the mini_libri recipe. Did you use the LM at http://www.openslr.org/resources/11/ ? Probably it will contain a lot of words that are not in my small dataset. How did you adapt run.sh to your own data? Thx in advance

OscarVanL commented 3 years ago

I used the same language model as the original run.sh script, so yes, the one you linked.

It shouldn't matter if the LM has words that aren't in your dataset.

My changes were minimal. Obviously I changed the dataset names to my dataset names, and switched the train script at the end to run_tdnn_1j.

RoxanaTapia commented 3 years ago

@OscarVanL may I ask where did you take run_tdnn_1j from, I only found one in egs/ami/local/tuning :/

OscarVanL commented 3 years ago

@OscarVanL may I ask where did you take run_tdnn_1j from, I only found one in egs/ami/local/tuning :/

https://github.com/kaldi-asr/kaldi/blob/master/egs/mini_librispeech/s5/local/chain/tuning/run_tdnn_1j.sh

RoxanaTapia commented 3 years ago

@OscarVanL Thanks for the help so far. How did you handle the missing files in graph/ (disambig_tid.int, Gr.fst, and HCLr.fst). Should HCLG.fst replace HCLr.fst? I have generated the Gr.fst as indicated in the Vosk Model Adaptation section before. Not sure if the files that @nshmyrev suggested are enough to try the model on Android. BTW, I didn't encounter the issue with ivector_dim, so I would suggest for further devs that they try running the run_tdnn_1j without the fix to 30 first. Thx

OscarVanL commented 3 years ago

Running the script @nshmyrev posted in https://github.com/alphacep/vosk-android-demo/issues/110#issuecomment-743916144 gathered all of the files I needed to run the model in Android. All I needed to add was model.conf in the conf folder, which I copied from the example pre-trained model.

If you follow the mini_librispeech script with run_tdnn_1j as the training script it should generate all the files you need.

RoxanaTapia commented 3 years ago

Running the script @nshmyrev posted in #110 (comment) gathered all of the files I needed to run the model in Android. All I needed to add was model.conf in the conf folder, which I copied from the example pre-trained model.

If you follow the mini_librispeech script with run_tdnn_1j as the training script it should generate all the files you need.

Did you restructure the files after running the script @nshmyrev posted? I indeed not have HCLr.fst and Gr.fst (can be generated). I managed to find the disambig_tid.int under exp/tri3b/graph_tgsmall (there is another one in exp/chain/tree_sp/graph_tgsmall)... I'm trying to replicate the same structure as in model-android :/

OscarVanL commented 3 years ago

I didn't change anything except model.conf. I literally just copy-pasted the output into Android studio and changed the source code's reference to the model path and everything worked! :)

Here's a screenshot of all the files and their structures from Android studio. image

OscarVanL commented 3 years ago

I'm going to close this issue as I have successfully trained a model I am satisfied with. Thank you to @nshmyrev for your fantastic assistance, and for your excellent Android example. It has achieved everything I hoped! :)

@RoxanaTapia Please feel free to reply if you have any more questions, I'd be happy to provide my (limited) experience if I can be of any further assistance.

RoxanaTapia commented 3 years ago

I just tried the trained model but it shows nothing :( not a single word. It just activates somehow... :(

OscarVanL commented 3 years ago

Sorry to hear that :(

If there are no errors from Android perhaps it's a problem with the model you trained rather than the selection of files generated during training.

I'd suggest trawling through the Kaldi logs for any suspicious errors.

RoxanaTapia commented 3 years ago

Thanks for caring. I really needed some results for my thesis, guess I'm gonna have to postpone the results for the next presentation :(

Here are some details of my experiment.

Data preparation:

Running the mini_libri recipe:

Android:

Things I'm suspicious about:

I'm gonna consult with my supervisor but I will really appreciate it if someone can help me find out what's wrong :)

OscarVanL commented 3 years ago

I'm also finishing off a fourth-year group dissertation for my degree 👍 Your project sounds interesting.

Some similarities between our experiences:

I also did the same for cmd.sh.

I also did not use .flac files - FYI if you look at the wav.scp file generated by the default mini_librispeech script, this includes commands to convert all of the .flac files into .wavs. So you should not be concerned about this, as the desired format by Kaldi is .wavs.

I also get warnings about too many silence pauses or not enough silence pauses. I think they're safe to ignore.

Here's some points I'd look at:

On this topic, it might be worth merging two datasets. For instance, you could take 100 hours of LibriTTS and merge this with your presidential speech dataset. This approach was very effective for us, as we were only able to compile 45-minutes of our speech dataset.

IMO if there are no obvious training errors my first focus would be on the dataset size. Is there any reason why you couldn't train with speeches + open data?

OscarVanL commented 3 years ago

Furthermore, if you do go down this route of adding more data, you'll have to figure out the GPU situation. I'm sure your University provides shared compute solutions for students.

RoxanaTapia commented 3 years ago

I found some issues in my mapping files. I'll come back soon :)

OscarVanL commented 3 years ago

Do you mean the spk2gender, spk2utt, text, utt2spk, and wav.scp files?

If you haven't already, check this out. It has lots of detail for how to prepare these files.

Personally I first had a few mistakes in text because of encoding errors (I had to use UTF-8 encoding when reading/writing with my Python script), and also because some transcriptions contained newlines \n which had to be stripped.

Also sorting is very important.

RoxanaTapia commented 3 years ago

I had the new line issue. I'm debugging now, I wonder it the text must match the audio files order in wav.scp

Also, I found that words.txt has only irrelevant words... So I'm thinking in filtering the data from Librispeech to match what the politicians say somehow and enrich with my data, as you suggested

OscarVanL commented 3 years ago

Yes, the orders must match according to that page. In theory if each file has the same selection of utterance IDs, sorting by this field should give the same sorted order.

RoxanaTapia commented 3 years ago

I just checked and they are sorted. The new line was definitely a big issue... I'll train again

OscarVanL commented 3 years ago

Did that fix it?

RoxanaTapia commented 3 years ago

Hi, sorry for the delay, I repair the bug of the new line and tried again with one speaker (small set)... It didn't fix the error. Some metrics I got from my data make me think that it might be an error with the length of some sentences (chunk translations). Here are the metrics (of my entire dataset):

I think I should do:

I also need to check that things are ok in Android, so right now I'm going to try to run the mini_librispeech example, and check if I can get it working in android at all. I'm using kaldi-android5.2 with the vosk android demo... There are other things you suggested I also need to check. Thanks for the type with nohup ./run.sh & I will use it next time.

So, basically, I'm gonna try to run the model on android with open-data and if it works it means that my data needs to be cleaned.

RESULTS The name of my data set are not matching! :O Maybe that's the error, I will train again

OscarVanL commented 3 years ago

It looks like you have a bit of a messy dataset, some sentences have 1 word, some have 1505 words?!? None of my utterances were as long as 14 minutes.

As for training your own language model, that's beyond my knowledge. Maybe the Kaldi Help forum could help.

I would suggest checking you've got the fundamentals right. Is your dataset standardised into a single format, is it consistent, is it properly labelled, and is Kaldi training without any glaring errors?

As for the RESULTS problem, make sure your training and testing dataset names are correctly set in run_tdnn_1j.sh, local/nnet3/run_ivector_common.sh, and run.sh.

RoxanaTapia commented 3 years ago

Yep, I was assuming too much while preparing the data. But I found a nice discussion on how I should proceed preparing the data.

For now some utterance/sentences metrics that could be useful (From the test set dev-clean-2 and the training set train-clean-5 included in mini_librispeech):

Duration (seconds):
 - Mean: 10.141
 - Min:  1.505
 - Max: 31.7
Size:
 - Mean: 158K
 - Min: 23K
 - Max: 495K
Words:
 - Mean: 28
 - Min: 1
 - Max: 88
Vocabulary size: 9138 words.
RoxanaTapia commented 3 years ago

Hi, I cleaned my data using create_uniform_segments and the Android App is still not showing any results

Metrics about my data:

- 1 speaker
- Speaker says 413 sentences
- 413 sentences/utterances in 413 chunk WAV files
- 24703 words. Per sentence words: Min: 1, Max: 102, AVG: 59.814
- Chunk Duration: Total 3.427 hours. Per sentence duration: Min: 0.242 minutes, Max: 0.5 minutes, AVG: 0.498 minutes

My model is 236.1 Mb in size (had to add org.gradle.jvmargs=-Xmx4096m to gradle.properties)

My RESULTS script:

%WER 55.29 [ 3195 / 5779, 178 ins, 1445 del, 1572 sub ] exp/tri3b/decode_tglarge_test/wer_13_1.0
%WER 60.61 [ 660 / 1089, 45 ins, 279 del, 336 sub ] [PARTIAL] exp/tri3b/decode_tgmed_test/wer_11_0.5
%WER 58.35 [ 3372 / 5779, 165 ins, 1559 del, 1648 sub ] exp/tri3b/decode_tgsmall_test/wer_16_0.0
%WER 68.97 [ 3986 / 5779, 174 ins, 1884 del, 1928 sub ] exp/tri3b/decode_tgsmall_test.si/wer_11_0.0
%WER 36.37 [ 2102 / 5779, 271 ins, 261 del, 1570 sub ] exp/chain/tdnn1j_sp/decode_tglarge_test/wer_8_1.0
%WER 39.99 [ 2311 / 5779, 275 ins, 301 del, 1735 sub ] exp/chain/tdnn1j_sp/decode_tgsmall_test/wer_10_0.0
%WER 37.00 [ 2138 / 5779, 253 ins, 296 del, 1589 sub ] exp/chain/tdnn1j_sp_online/decode_tglarge_test/wer_9_1.0
%WER 41.17 [ 2379 / 5779, 206 ins, 423 del, 1750 sub ] exp/chain/tdnn1j_sp_online/decode_tgsmall_test/wer_10_0.5

I'm running out of ideas :( Did you get anything showing in the Android App @OscarVanL?

Things I need to try:

It seems like I get some results, but nothing shows in the app with the files proposed by @nshmyrev that you show above

OscarVanL commented 3 years ago

It seems like your model has failed to converge, or overfitted the data.

I don't see that there would be any benefit for downsampling to 8kHz, this script was built for use with 16kHz. Are your source files only at 8kHz? Maybe the parameters within this model are tuned for 16kHz.

For the Gr.fst and words.txt, I recommend using a pre-trained 3-gram language model and vocabulary. The mini_librispeech script should download and prepare this for you and uses this language model. I did not even touch the language model, and I suggest you leave it alone too as my gut says your problems lie elsewhere.

Your approach is at odds with the advice @nshmyrev gave me at the beginning of this thread, to use 200-300 hours of speech data to train. (Note: You need to turn down the number of epochs in run_tdnn_1j if you use this much data, or it will take forever to train).

It seems you have two orders of magnitude too small a dataset, if your 1 speaker is the only data you're training on.

RoxanaTapia commented 3 years ago

Hello, I trained again the models with only filtered mini-libri-speech data. The app still doesn't show anything, not a single word...

Here are my RESULTS:

%WER 12.93 [ 1540 / 11910, 188 ins, 181 del, 1171 sub ] exp/tri3b/decode_tglarge_dev_clean_2/wer_17_0.0
%WER 15.40 [ 1834 / 11910, 180 ins, 266 del, 1388 sub ] exp/tri3b/decode_tgmed_dev_clean_2/wer_17_0.0
%WER 16.94 [ 2018 / 11910, 183 ins, 316 del, 1519 sub ] exp/tri3b/decode_tgsmall_dev_clean_2/wer_17_0.0
%WER 23.36 [ 2782 / 11910, 250 ins, 432 del, 2100 sub ] exp/tri3b/decode_tgsmall_dev_clean_2.si/wer_17_0.0
%WER 7.52 [ 896 / 11910, 110 ins, 92 del, 694 sub ] exp/chain/tdnn1j_sp/decode_tglarge_dev_clean_2/wer_11_0.5
%WER 10.96 [ 1305 / 11910, 136 ins, 157 del, 1012 sub ] exp/chain/tdnn1j_sp/decode_tgsmall_dev_clean_2/wer_10_0.0
%WER 7.55 [ 899 / 11910, 107 ins, 87 del, 705 sub ] exp/chain/tdnn1j_sp_online/decode_tglarge_dev_clean_2/wer_10_0.5
%WER 10.98 [ 1308 / 11910, 135 ins, 158 del, 1015 sub ] exp/chain/tdnn1j_sp_online/decode_tgsmall_dev_clean_2/wer_10_0.0

I'm quite lost at this point. @OscarVanL could it be possible that you share your model just to make sure that my app works at all? I would really appreciate it

OscarVanL commented 3 years ago

Have you tried just cloning this repo, changing nothing and using it with the built-in pre-trained model? Have you considered using a different Android phone?

RoxanaTapia commented 3 years ago

Yes, I did that before starting the models and now again. It's better in the sense that it transcribes nearly real-time, but the contents make little sense, e.g. the word "and" gets translated to "aws", so it's not really legible. I would estimate a WER of <50% ... The model also looks quite different than the one generated here... The new model only outputs the word "now" or so after listening to 2 mins of audio or so... Other than training my own language model IDK what else could I do, I'm afraid to invest more time in training the language model to not get any useful results afterwards

PS: I haven't check with other phone. CPU Hisilicon Kirin 659, 4Gb memory

OscarVanL commented 3 years ago

I found the built-in example made loads of mistakes for me too with my British accent, my own trained model performed much better.

I think the models you train are very sensitive to accents, LibriTTS/LibriSpeech predominantly consist of American speakers. I created a subset of LibriTTS containing only British speakers for this reason.

When training with a 50/50 split of British and American accents it then recognised my British accent very well (I would have trained entirely on British accents, but there is not enough data).

I don't know what your accent is, but are you testing it with your own speech, or speech similar to that which you are training on?