daanzu / kaldi-active-grammar

Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time
GNU Affero General Public License v3.0
339 stars 50 forks source link

Fine Tuning #33

Open dpny518 opened 4 years ago

dpny518 commented 4 years ago

Do you have the procedure for fine tuning the model with our own data? Would we use this model https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.4.0/kaldi_model_daanzu_20200328_1ep-mediumlm.zip or https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.4.0/vosk-model-en-us-daanzu-20200328.zip

then apply this script and update the paths? https://github.com/kaldi-asr/kaldi/blob/master/egs/rm/s5/local/chain/tuning/run_tdnn_wsj_rm_1c.sh

daanzu commented 4 years ago

I'm working on simplifying and documenting how to perform fine tuning. I would say to use https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.4.0/kaldi_model_daanzu_20200328_1ep-mediumlm.zip

I've had more success with the procedure in egs/aishell2/s5/local/nnet3/tuning/finetune_tdnn_1a.sh adapted for chain.

dpny518 commented 4 years ago

Thanks, I got it working as well, i modified that one and added ivectors and used the dnn alignment instead of gmm...however my WER is very high, almost 100% wrong(the graph and language model is correct because i switched it with original final.mdl and the WER dropped down < 20%). Do you have the paramaters for fine tuning for your examples of about 30 hours...number epochs, learning rate initial, learning rate end, minibatch size

daanzu commented 4 years ago

I am planning on doing much more experimentation on this soon, but I think I had most success with parameters like this:

num_epochs=5
initial_effective_lrate=.00025
final_effective_lrate=.000025
minibatch_size=1024
dpny518 commented 4 years ago

Thank you, works beautifully, dropped WER from 38% to 30% with only 1 hour of train data and independent test set

daanzu commented 4 years ago

Great! Thanks for the WER% info.

vasudev-hv commented 4 years ago

@daanzu - Can you please tell me where can I find the following files: exp/tri3/final.occs data/lang/oov.int data/lang/phones.txt (is this new phones.txt the I need to generate(how?) or the one provided in the daanzu zip file?) data/lang/L.fst (symlinked with kaldi_model/L_disambig.fst. but align.1.log says Failed to read token [started at file position -1], expected <TransitionModel>)

Also, I have symliked the following files. Was that expected? exp/tri3/final.mdl -> kaldi_model/final.mdl exp/tri3/tree -> kaldi_model/tree

vasudev-hv commented 4 years ago

added ivectors and used the dnn alignment instead of gmm

@yondu22 - Can you please elaborate how did you do this? What changes did you make in the script?

dpny518 commented 4 years ago

used this script https://github.com/kaldi-asr/kaldi/blob/master/egs/rm/s5/local/chain/tuning/run_tdnn_wsj_rm_1c.sh change wsj to danzu, and rm to the new data

daanzu commented 4 years ago

@vasudev-hv This is a nnet3 chain model, not tri3. The other files can be generated from the ones included in the download.

JohnDoe02 commented 4 years ago

I am currently also trying to setup a training pipeline. While I recently managed to get run_tdnn_wsj_rm_1c.sh to complete the training, I am not yet able to obtain a final.mdl which outperforms the input model. To give some background and as it might be useful for others with similar intentions, here are the steps I made.

Training Data

Around 9000 utterances from my day-to-day use, recorded and labeled by KAG within the retain.tsv. I manually went through all of them and removed wrongly labeled ones. So the training data should be clean from mislabelings, however, it is of course highly imbalanced. E.g., I say super ticky more than anything else to bring up a Quake-like terminal. On average, each utterance is about 1s long and after split into test and training set, around 7200 remain for training which nicely sums up to ~2h.

Language model/Decoding Graph

My language model I build using the original lexicon.txt and additionally merging the user_lexicon.txt. I further created silence_phones.txt (SIL, SPN, NSN) and optional_silence.txt (SIL) and nonsilence_phones.txt (everything else). Together with the phones.txt from the original model and the noise word <unk>, I give those to Kaldi's utils/prepare_lang.sh which creates L.fst, etc.

The only thing remaining from here to build a decoding graph is the grammar (G.fst). As I expected problems with the original G.fst shipped with the recently released models due to the unkown words I added from user_lexicon.txt, I created a simple two-gram model with ngram-count based on the input utterances.

Verification of the Decoding Graph

I can successfully pair the original final.mdl with my freshly created decoding graph. The WER on my testing data is very low, something like 3%. I expected something on this range because of the two-gram grammar which should have a similar effect as the enabling/disabling of rules in different contexts within day-to-day use.

Getting run_tdnn_wsj_rm_1c.sh to run

Most of the necessary changes were pretty straightforward, e.g., changing paths to their new locations etc. More interesting are the following changes, that I found necessary:

Step 6

I added cp $lat_dir/ali.*.gz $src_tree_dir before the call to make_weighted_den_fst.sh as otherwise this call would fail. However, I believe this is not the proper solution. I believe the script expects alignment files from the previous training (i.e., from the training that @daanzu performed when creating the model) to be present. I guess these are then used to present old training data together with new training data when doing the actual transfer learning in step 7. My quick fix simply copies the alignment files from the new training data over, so there is no original training data. This might very well be a problem! @yondu22 How did you go about this?

Step 7

I am using chain_opts=(--chain.alignment-subsampling-factor=3 --chain.left-tolerance=7 --chain.right-tolerance=7) [vs. chain_opts=(--chain.alignment-subsampling-factor=1 --chain.left-tolerance=1 --chain.right-tolerance=1) upstream] as otherwise get_egs.sh would fail, or even worse, throw out large parts of my training data without complaining that something is wrong. However, I don't really know if the increase of said tolerances might have negative impacts on the training. I also set --egs.chunk-width=20 [vs. --egs.chunk-width=150 upstream] as otherwise a lot of short utterances which are shorter than 150 frames will be thrown out. To me, it really seems vital to check to the logs of get_egs.sh even if you don't see any complains when starting the training.

I should also note that due to having an AMD card I cannot use CUDA and am restricted to CPU training. The following changes might not be necessary/different with an Nvidia card: I needed to allow for reduced minibatch sizes as otherwise some script that combines models from different jobs would fail (I am using --trainer.num-chunk-per-minibatch=128,64,32,16,8). Also, I was so far unable to use more than 3 jobs (I am using --trainer.optimization.num-jobs-initial=3 --trainer.optimization.num-jobs-final=3, where I have no idea about the difference between initial and final jobs).

Results

WER 30.40 [ 1109 / 3648, 111 ins, 800 del, 198 sub ] after a single iteration, i.e., a large decrease in accuracy (WER ~3% on the test set with the original model). On the bright side: The model is not complete garabage! I can replace the official final.mdl with my new one and KAG will work (with less accuracy of course).

Disclaimer

Both, speech recognition as well as Kaldi a rather new to me. So all of my learnings should be taken with a grain of salt. In the worst case, they might plainly be wrong.

Links to the full scripts

Data prep: https://github.com/JohnDoe02/kaldi/blob/private/egs/rm/s5/local/prepare_data.py Language model / Decoding graph: https://github.com/JohnDoe02/kaldi/blob/private/egs/rm/s5/local/prepare_daanzu_lang.sh Training script: https://github.com/JohnDoe02/kaldi/blob/private/egs/rm/s5/local/chain/tuning/run_tdnn_wsj_rm_1c.sh

CC @vasudev-hv

daanzu commented 4 years ago

@JohnDoe02 Wow, great detailed write up! Thanks for posting it.

As stated earlier, I had more success adapting the aishell2 finetuning script than run_tdnn_wsj_rm_1c, although it has been a while since I compared them and I don't remember the details of how the latter works. In general for fine tuning, you shouldn't need the alignments for the initial training, only the newly generated alignments for the new training data. Regarding frames_per_eg, you might want to try frames_per_eg=150,110,100,40 instead, so it uses larger chunks when possible, and only smaller ones when necessary (in general <100 seem to not work as well, but I hate to throw out too much data). Regarding number of jobs, I think >1 is only useful for using multiple GPUs, so I would suggest using 1 for both initial and final.

Sorry for being so slow to finish cleaning up my version. I will at least get the basic script posted ASAP. And I hope to get a nice Docker image put together to make it relatively easy.

JohnDoe02 commented 4 years ago

@daanzu frames_per_eg (aka egs.chunk-width) was a good hint. It indeed seems to play an important role. Stating multiple values as suggested gets my WER down to 20%. To me, this highly suggests that the training files should not be too short. I will proceed with preparing a cleaner dataset, i.e., less imbalanced and >150 frames per utterance.

I am very much looking forward to having a look at your training script. Don't polish too hard, anything helps!

daanzu commented 4 years ago

https://gist.github.com/daanzu/d29e18abb9e21ccf1cddc8c3e28054ff It's not pretty, but maybe it can be of use until I have something better.

Regarding training files length, including a some amount of prose dictation of reasonable length definitely can be a big help. I think it still can be helpful to include short commands in the training, plus it is so easy to collect a large set of them through normal use, but they have weaknesses.

JohnDoe02 commented 4 years ago

Thanks for posting! I will try to get it to run as well. On first sight I find it rather interesting that you are using --chain.alignment-subsampling-factor=1 which did not work at all for me. get_egs.sh will immediately fail in this case, after spitting out warnings like the following for each of my training files:

WARNING (nnet3-chain-get-egs[5.5.790-ebb43]:LengthsMatch():nnet-example-utils.cc:584)
Supervision does not have expected length for utterance retain_2020-07-01_15-12-30_019191:
expected length = (74 + 3 - 1) / 3 = 25, got: 74 (note: --frame-subsampling-factor=3)

Also no need to ramp up the tolerances, i.e., you are using --chain.left-tolerance=1 --chain.right-tolerance=1.

Very interesting. I'll investigate.

daanzu commented 4 years ago

It's been a while since I started experimenting, and I can't recall exactly how I ended up with this. I think the chain.alignment-subsampling-factor is because I generate the alignments with the chain model itself, rather than the traditional method of using an earlier-trained GMM model, but I could be wrong. But it seemed to work well.

JohnDoe02 commented 4 years ago

I have experimented some more and prepared a very clean data set, with a about 1h of dictation, 1h of command-like speech and 1h from day-to-day use. However, this did not bring much improvement. For my new (more complex) data set I got a reference value for the WER of 20 with the official model which degraded to a WER of 60 after training. My conclusion is that the training data set is not the main culprit.

Next, I focused on getting your script to run as well. Interestingly, I ran into pretty much the same problems as with the run_tdnn_wsj_rm_1c.sh script. The training stage would complain about missing alignment files in the tree directory. Again I applied my hack and added cp $lat_dir/ali.*.gz $tree_dir. Additionally I had to add --generate-ali-from-lats true to the align_lats.sh call to generate the alignment files from the lattice files for my training data. Also I had to use --chain.alignment-subsampling-factor=3) as otherwise get_egs.sh would fail with the error mentioned above.

With this setup, my results were similar as with the run_tdnn_wsj_rm_1c.sh script. A strongly degraded accuracy due to the training (20 WER vs. 75 WER).

I investigated some further and found out that kaldi uses the phone sequences within the alignment files at the training stage to calculate a special 4-gram phone language model (cf., The denominator FST at: http://www.kaldi-asr.org/doc/chain.html). Apparantly, this phone language model is recalculated from scratch as the first prerequesite for the actual training (stage=-6 in train.py) and due to my hack is of far inferior quality as it stems from around 3h of training data only (vs. 3000h of training data with the original alignment files). I suspect this to be the reason for the strong decrease in quality of the resulting model after training.

@daanzu Would you mind uploading your original alignment files (I guess their total size should be on the order of 1G), so that I can check if their presense is the proper fix for the issue?

daanzu commented 4 years ago

@JohnDoe02 I am a bit puzzled by your experience, but I will try to find time to look into it more.

Which alignment files are you looking for? I sure think I ran my fine tuning experiments on an export very similar to the published package.

JohnDoe02 commented 4 years ago

From looking at your script, the tree directory is defined in the beginning as tree_dir=exp/nnet3_chain/tree_sp. It should contain files like ali.1.gz, ali.2.gz, etc. These are the alignment files I am looking for. Their number depends on the number of jobs that where used to generate them. Typically between 20 and 100. For 3000h hour of training data, I would estimate ~1G of size in total for all of them.

As far as I can tell from my experiments, these files must have been present from the start when you ran your finetuning script (i.e., were not generated during execution). While align_lats.sh is capable of generating such files for the finetuning data, you don't do so by not using the --generate-ali-from-lats true switch.

daanzu commented 4 years ago

@JohnDoe02 Ah, I didn't realize that was there and a dependency. Good find! Attached is (I think) the tree directory for the most recent models. It will be quite interesting to see your results and comparison.

https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.8.0/tree_sp.zip

JohnDoe02 commented 4 years ago

Nice, thanks for uploading! The scripts are now happy, no more missing files. The only remaining confusion is the --chain.alignment-subsampling-factor=3 parameter, as both you as well as the authors of run_tdnn_wsj_rm_1c.sh had this one set to 1 -- which refuses to work for me, although I am using the same setup (at least as far as I can tell).

In any case, I will run some experiments and report back how things go!

JohnDoe02 commented 4 years ago

So first of all let's start with the good news. I'm finally able to obtain models which perform better on the test set then the input model (16 WER vs. 20 WER is the best I got so far). Indeed, if plugging it into KAG, I am finally understood when saying eight jury which gives 8j (instead of pj or aj -- I am having a lot of vim frustration with this). However, there are other new issues which I am still investigating. So I cannot really use the model day-to-day yet.

Regarding your alignment files for which I really had hoped that they would be the magic missing piece in the puzzle (I was dreaming of error rates jumping down to ~3-5 WER -- I am so naive ;) ), this turned out to be not the case. Using your alignment files the error drops down to 20-30 WER after training in comparison to ~20 WER before training, i.e., I end up with a slightly less performant model. Using your files, I was not able to outperform the input model yet.

For the result mentioned in the beginning, I used my alignment hack. The reason that error is so much lower than before is because there was a bug in hack (:D). There is an innocent-looking file in the tree-dir, num_jobs, which contains an integer stating the number of jobs that where used for obtaining the alignment files. This file is used by kaldi to determine how many ali.*.gz files are to be read. Some time ago, in my first attempts to get the training script running, I had written 1 into this file -- and never bothered again. Well, turns out kaldi always used only one of my alignment files and therefore I was seeing such large errors before.

For the moment I am now again focusing on gathering more data (currently, I have about 5h).

JohnDoe02 commented 4 years ago

I believe to have found the reason and correct fix for the --chain.alignment-subsampling-factor=3 surprise. There seems to be another file missing: Within the $src_dir (which simply holds the unzipped kaldi_model) align_lats.sh checks for a file frame_subsampling_factor. As we are using a chain model, its contents should probably read 3. Indeed echo 3 > kaldi_model/frame_subsampling_factor allows me to use --chain.alignment-subsampling-factor=1 --chain.left-tolerance=1 --chain.right-tolerance=1 (the tolerances also scale by a factor of three).

@daanzu Could you please confirm that you have such a file with such content within your original src dir? Telling from your script it should be within src_dir=exp/nnet3_chain/tdnn_f-2.1ep

daanzu commented 4 years ago

@JohnDoe02 Good find! Yes, my source directory has a frame_subsampling_factor file containing 3, and now that makes sense why the other things were happening. Confusing that missing the file causes it to silently assume 1. Are your accuracy results better after training with this fix? I would think it would help significantly.

dpny518 commented 4 years ago

I commented out the lines and deleted ${chain_opts[@]}, and went with the default

# we use chain model from source to generate lats for target and the
  # tolerance used in chain egs generation using this lats should be 1 or 2 which is
  # (source_egs_tolerance/frame_subsampling_factor)
  # source_egs_tolerance = 5
  #chain_opts=(--chain.alignment-subsampling-factor=3 --chain.left-tolerance=1 --chain.right-tolerance=1)
  steps/nnet3/chain/train.py --stage $train_stage  \
    --cmd "$decode_cmd" \

3, 5 ,5 https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/nnet3/chain/train.py#L98

It seemed to work well for me, let me know your experience and if should change 3, 1, 1 or 1,1,1

JohnDoe02 commented 4 years ago

Against all expectations, my results degraded (a lot!) by using the frame_subsampling_factor file. I still believe it is the correct fix, though. And while my results with 3 7 7 were much better, they were still not good enough to produce a worthwhile model. So apparently, there is still something wrong. I am running out of ideas though.

For now, I switched strategy and am training from scratch. My first results look very promising with error rates around 0.75 for my command dataset (basically consisting of utterances using only my words for numbers, letters and symbols). With the original model I get values >10. Also dictation looks way better after I threw librispeech's train_clean_100 dataset into the mix. My own training date is at around 12h (+3h of various test data), so a total of 112h training data. I based the whole thing on the librispeech recipe.

Downside is of course that training takes much longer and becomes impossible without GPU. Also I was not able yet to make one of my from-scratch-trained models work with KAG.

daanzu commented 4 years ago

@JohnDoe02 Surprising! FWIW, I trained my first 100%-personal model with just ~4h of audio, and was astounded at how decent the results were, considering. And building up a corpus isn't too hard during general use with my "action corrected" command.

dpny518 commented 4 years ago

@johndoe I would double check you work, we all got better results Make sure your new data is low caps to match danzu

JohnDoe02 commented 4 years ago

@daanzu Yes, I am astounded as well. This works out-of-the-box much better than anticipated.

@yondu22 Good point with the lowercase vs. uppercase. Obviously I ran into this problem, too. Fixed that one about 5 weeks ago.

As mentioned above, for now I a simply training from scratch. My first results are nothing short of amazing, and while I am still facing an integration issue regarding dictation (cf., #39), spelling is already working much better.

However, I am still interested in getting the transfer script running as well. One thing I am curious about is if aligning with a GMM would bring a difference. Also there is run_tdnn_wsj_rm_1a.sh which at least for the RM corpus seems to have given better results than the 1c variant we are using.

daanzu commented 4 years ago

@JohnDoe02 I'm skeptical that aligning with a GMM model would help, but if you want to give it a try, I could upload the GMM model that was used in the training of my published model. Experiments are always interesting!

JohnDoe02 commented 4 years ago

@daanzu Sounds great! I will give a shot

daanzu commented 4 years ago

@JohnDoe02 Here's the GMM model. Apologies for the delay!

tri2b.zip

Ashutosh1995 commented 4 years ago

Hi @daanzu , I am trying to finetune Indian English model with my wake word data. I am not able to understand as to where to give the path for my dataset as well as the model downloaded from the VOSK webiste.

Would really be helpful if you could clarify the path issue from the script you provided?

daanzu commented 3 years ago

@Ashutosh1995 I just pushed an updated version that is a bit more explicit about the inputs. However, the model you are using very well may not include everything necessary, especially to make fine tuning easy.

https://gist.github.com/daanzu/d29e18abb9e21ccf1cddc8c3e28054ff

Ashutosh1995 commented 3 years ago

@daanzu I got the path information. There was one more query I had.

How much data is required to perform finetuning for a task like wake word detection on the model file you have provided ?

daanzu commented 3 years ago

@Ashutosh1995 Any amount of training data should be helpful, but of course, the more the better. I haven't yet tested as thoroughly and rigorously as I would like. https://github.com/daanzu/kaldi-active-grammar/blob/master/docs/models.md

Ashutosh1995 commented 3 years ago

Hi @daanzu , splice_opts is not present in https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.4.0/kaldi_model_daanzu_20200328_1ep-mediumlm.zip

Could you please provide that ?

daanzu commented 3 years ago

@Ashutosh1995 cp $model/conf/splice.conf extractor/splice_opts

Ashutosh1995 commented 3 years ago

Hi @daanzu, I reached Stage 9 but there I get an exception saying that

Exception: Expected exp/nnet3_chain/tree_sp/ali.1.gz to exist.

I couldn't find .gz file in the source model folder. Could you please help in this regard!

I am a bit novice to kaldi hence figuring out things.

daanzu commented 3 years ago

@Ashutosh1995 See https://github.com/daanzu/kaldi-active-grammar/issues/33#issuecomment-706721159 above, and use that file.

jose-alvessh commented 3 years ago

Hi @daanzu,

I have followed the steps of the file that you have here in this comment. Then when I replaced only the final.md file obtained after finetune on your model vosk-model-en-us-daanzu-20200905-lgraph I did not get any results when I use it to test my validation set using the Android demo of vosk-api (completely empty results). With the original model I get excellent results.
I have fine-tuned the model with around 4000 sentences (-40 different sentences repeated by 100 users) with ~2 words each and the process does run without any major error.

Do I need to replace any other file on the Android model in order to get the results from the fine-tuned model (I've searched and I did not find which files we should change in the original models)? Or is my data just not enough and I'm getting no errors due to that? I did check and changed already the Language model (Gr.fst file) and I still did not get any results. I have also changed a few times the learning rates and all the variables that you referred here

Thanks for your awesome job and I hope you can help me on that. Regards,

Asma-droid commented 3 years ago

Hi @daanzu

I need help for fine tune the aspire model("vosk-model-en-us-aspire-0.2"). The model has this following structure image I have taken "https://gist.github.com/daanzu/d29e18abb9e21ccf1cddc8c3e28054ff#file-run_finetune_tdnn_1a_daanzu-sh" as a template but my problem is that how to configure it!!! I dont know where i can found the conf_dir, lang_dir. Also, i am not able to found the exp folder containing: nnet3_chain, ${data_set}_ali, ${data_set}_lats. Are there methods to generate them from a pre-trained models ?

Any help please?

xizi642 commented 3 years ago

Hi @daanzu ,

I am fine tuning the aspire model, and I also stuck with the stage 9 without the tree_sp for aspire. Can you please upload that for aspire as well? Thank you .

wwbnjsace commented 1 year ago

the size of final.mdl download from https://github.com/daanzu/kaldi-active-grammar/releases/download/v1.4.0/vosk-model-en-us-daanzu-20200328.zip is about 70M ;but when i train from zero ,i use the vosk-api-master/training/local/chain/run_tdnn.sh model parameter , in the end ,my final.mdl is about 20M. anyone can tell me why

wwbnjsace commented 1 year ago

@vasudev-hv This is a nnet3 chain model, not tri3. The other files can be generated from the ones included in the download.

i use the targe lexicon.txt , thus generate the phones.tx.txt,L.fst and so on ,but in the steps/nnet3/align_lats.sh,stage ,there is an error "ERROR (compile-train-graphs[5.5]:GetArc():context-fst.cc:177) ContextFst: CreateArc, invalid ilabel supplied [confusion about phone list or disambig symbols?]: 335 "
any reson?

wwbnjsace commented 1 year ago

@vasudev-hv This is a nnet3 chain model, not tri3. The other files can be generated from the ones included in the download.

the L.fst ,phones.txt can genete from the download? i use the download lexicon.txt to gen the phones.txt ,but they are different,so i can not do finetune

moodpanda commented 6 months ago

is it possible to finetune this model using this method? vosk-model-tl-ph-generic-0.6

NimishaPithala commented 2 months ago

I am trying to finetune the model [vosk-model-small-en-us-0.15] using a noisy dataset. (Librispeech clean augmented with noisy data) . I tried following the above steps but I wanted to know if anyone tried this before or is the performance similar to other vosk models.

also, while changing the hyperparameters for tuning, can we try the grid method here in kaldi for vosk to understand what parameters are good for a bteer performance or emperical experimentation , like trying to change the parameters randomly, is only possible?