Closed OscarVanL closed 3 years ago
In general, you can simply train instead of all those fine-tuning. American model gonna be useless for you anyway.
From your sample I didn't understand - is there a single voice in the end or all patients sound differently, just a bit metallic?
Some unique characteristics for each patient will come through. But mostly they will have a metallic voice.
The reason I don't just train from scratch with the patient data is I feared there would be too little data (5 hours), so I thought with tuning I could get better generalisation. In this case, what would you suggest?
Perhaps I could combine some of my "normal" British speakers dataset with the patient speech and train on that, but I feared the speech might be too different for this to be effective.
Since the model is tiny it doesn't need that much data, something like 200-300 hours is enough. I would get British dataset first (get one from youtube or filter tedlium speakers, filter librispeech speakers), something like 100 hours is enough, then I would mix your specific data and run augmentation.
Maybe you need to spend more time on augmentation to try to create a similar voice to your samples, that will help. For training you can take mini_librispeech recipe.
I have already filtered LibriTTS (Librispeech) to get about 30 hours of British speakers, there is also the ARU speech corpus.
It looks like I will still need a bit more data though. Thanks for the suggestions.
On the models documentation page it says: _Latest minilibrispeech uses online cmvn which we do not support yet. Use this script to train nnet3 model.
Is this still necessary?
Is this still necessary?
Yes
Is this still necessary?
Yes
Thanks, I presume this just means changing this line:
To call run_tdnn_1j.sh?
Correct.
I've trained the mini_librispeech example with the above change.
I want to test this model on the app to make sure it works.
I need to build the model structure and am following the instructions here, but there are some ambiguities for which files I need to use.
I've listed all the ambiguities in my mini_librispeech
folder and bolded the ones I think I should use. Please could you help advise which I'm supposed to use? :)
am/final.mdl
conf/mfcc.conf
conf/model.conf:
ivector/final.dubm
ivector/final.ie
ivector/final.mat
ivector/splice.conf
(splice.conf is not present in the ivector_extractor folder.)
ivector/global_cmvn.stats
ivector/online_cmvn.conf
graph/phones/word_boundary.int
(I have no idea...)
graph/HCLG.fst
graph/HCLr.fst
graph/Gr.fst
graph/phones.txt
(no idea, there are so many...)
graph/words.txt
(Also no idea...)
rescore/G.carpa
rescore/G.fst
(no idea)
Thank you!
#!/bin/bash
# path to directory where model path will placed
dir="${1:-$HOME}"
echo "$dir"
if [ ! -d "$dir/model" ]
then
mkdir -p "$dir/model/ivector"
fi
cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.dubm "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.ie "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/final.mat "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/global_cmvn.stats "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/online_cmvn.conf "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/ivector_extractor/splice_opts "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/conf/splice.conf "$dir/model/ivector"
cp exp/chain/tdnn1*_sp_online/conf/mfcc.conf "$dir/model"
cp exp/chain/tdnn1*_sp_online/final.mdl "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/HCLG.fst "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/words.txt "$dir/model"
cp exp/chain/tree_sp/graph_tgsmall/phones/word_boundary.int "$dir/model"
Thank you!
Great, I tried the trained model on your Android example app and it works. I'm excited to train the full model next :D Thanks for the help!
Ok, you can read https://alphacephei.com/nsh/2020/03/27/lookahead.html about Gr.fst
Hi, I've adapted the mini_librispeech run.sh to use my own dataset, but when it reaches our modified train line:
local/chain/tuning/run_tdnn_1j.sh
it fails.
Here are some of the logs from exp/chain/tdnn1j_sp/egs/log/shuffle.*.log
:
nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat exp/chain/td$
nnet3-chain-shuffle-egs --srand=20 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.20.ark
ERROR: CompactFst write failed: <unknown>
ERROR (nnet3-chain-normalize-egs[5.5.851~1-088e9]:WriteToken():io-funcs.cc:141) Write failure in WriteToken.
bash: line 1: 14346 Aborted (core dumped) nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst "ark:cat exp/chain/tdnn1j_sp/egs/cegs_orig.1.20.ark exp/chain/tdnn1j_sp/egs/cegs_ors_orig.75.20.ark|" ark:-
14348 Killed | nnet3-chain-shuffle-egs --srand=$[20+0] ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.20.ark
nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat exp/chain/tdnn1j_sp/egs/cegs_orig.1.25.ark exp/chain/tdnn1j_sp/egs/cegs_orig.2.25.ark exp/chain/tdnn1j_sp/egs/cegs_orig.3.25.ark exp$
nnet3-chain-shuffle-egs --srand=25 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.25.ark
ERROR (nnet3-chain-normalize-egs[5.5.851~1-088e9]:Write():compressed-matrix.cc:563) Error writing compressed matrix to stream.
nnet3-chain-normalize-egs --normalization-fst-scale=1.0 exp/chain/tdnn1j_sp/normalization.fst 'ark:cat exp/chain/tdnn1j_sp/egs/cegs_orig.1.29.ark exp/chain/tdnn1j_sp/egs/cegs_orig.2.29.ark exp/chain/tdnn1j_sp/egs/cegs_orig.3.29.ark exp$
nnet3-chain-shuffle-egs --srand=29 ark:- ark:exp/chain/tdnn1j_sp/egs/cegs.29.ark
WARNING (nnet3-chain-normalize-egs[5.5.851~1-088e9]:main():nnet3-chain-normalize-egs.cc:84) For example sp1.1-061360_00840500361-96, FST was empty after composing with normalization FST. This should be extremely rare (a few per corpus, at most)
LOG (nnet3-chain-normalize-egs[5.5.851~1-088e9]:main():nnet3-chain-normalize-egs.cc:94) Added normalization to 21131 egs; had errors on 1
LOG (nnet3-chain-shuffle-egs[5.5.851~1-088e9]:main():nnet3-chain-shuffle-egs.cc:104) Shuffled order of 21131 neural-network training examples
Stages 1-8 of run.sh seem to complete without any problems, but the stage 9 training (where we modified the script) fails. Any ideas?
Most likely it is out of disk space.
Oh, wow. I had 250GB before I started this! I will check that 🤭
I tried freeing up some space and trying again. I started training with over 500GB of free storage and it still failed. I checked the free system space with df
at different points while the script was running and there are still hundreds of GB free.
I don't think storage space was the cause of the failure.
Maybe something is wrong with that particular utterance sp1.1-061360_00840500361-96. Did you check it, is it ok?
Core dump is something that is better reported in Kaldi github.
Thanks for the suggestion. In the train dataset this file is fine. You can download it here.
I have checked the file, and it is the same sampling rate, number of bits, mono as the rest of the dataset.
A number of errors like that appeared for other speech samples too (although the number of occurrences was small in comparison to the samples in the dataset.)
Since this is a Kaldi issue rather than a Vosk one, I made a more detailed issue here. It includes my train script, and a zip of all the logs.
Hi,
I'm getting a new error now during training:
run.sh logs:
steps/nnet3/chain/get_egs.sh: Finished preparing training examples
2020-12-30 04:40:38,883 [steps/nnet3/chain/train.py:428 - train - INFO ] Copying the properties from exp/chain/tdnn1j_sp/egs to exp/chain/tdnn1j_sp
2020-12-30 04:40:38,908 [steps/nnet3/chain/train.py:451 - train - INFO ] Preparing the initial acoustic model.
2020-12-30 04:40:41,200 [steps/nnet3/chain/train.py:485 - train - INFO ] Training will run for 20.0 epochs = 668 iterations
2020-12-30 04:40:41,204 [steps/nnet3/chain/train.py:529 - train - INFO ] Iter: 0/667 Jobs: 2 Epoch: 0.00/20.0 (0.0% complete) lr: 0.004000
run.pl: job failed, log is in exp/chain/tdnn1j_sp/log/compute_prob_train.0.log
run.pl: job failed, log is in exp/chain/tdnn1j_sp/log/compute_prob_valid.0.log
exp/chain/tdnn1j_sp/log/compute_prob_train.0.log logs:
ERROR (nnet3-chain-compute-prob[5.5.851~1-088e9]:AcceptInput():nnet-compute.cc:561) Num-cols mismatch for input 'ivector': 30 in computation-request, 100 provided.
Immediately the number 30 caught my attention, in your "Training your own model" documentation you say:
Train ivector of dim 30 instead of standard 100 to save memory of mobile models.
Am I correct in thinking that the value ivector_dim=100
in steps/online/nnet2/train_ivector_extractor.sh
needs to be changed to ivector_dim=30
to fix this error?
Edit: Yes, this did fix it :)
Thanks :)
Hi,
I've got a model trained successfully and it works, thanks for the help. I have a question.
When there are background noises, <UNK>
is sometimes added to the transcription using my models, whereas it's not in the default model. Is this something I can disable?
I'd rather no text be added to the transcription if there are unusual or unrecognised noises.
Thanks :)
One more thing, my model size is much larger than yours.
Your bundled model is 50.5MB, my model is 224MB.
My model seems to perform fine, so there are no practical issues introduced by this size increase, but is this something I should expect to see? Obviously, I would rather the file size be smaller for use on a mobile app, but not at the expense of performance.
In addition, how many Epochs do you think would be good for a dataset of this size? Someone on the Kaldi Help forum thought 20 was far too many.
Hello @OscarVanL, I'm trying to train a model with my own data as well to use in the Android app. I have prepared my audio data that consists of public speeches. I split the audio files into chunks and prepared the spk2gender
, wav.scp
, text
, utt2spk
, and corpus.txt of de data preparation part. I'm unsure what LM could I use to apply the mini_libri recipe. Did you use the LM at http://www.openslr.org/resources/11/ ? Probably it will contain a lot of words that are not in my small dataset. How did you adapt run.sh to your own data? Thx in advance
I used the same language model as the original run.sh script, so yes, the one you linked.
It shouldn't matter if the LM has words that aren't in your dataset.
My changes were minimal. Obviously I changed the dataset names to my dataset names, and switched the train script at the end to run_tdnn_1j.
@OscarVanL may I ask where did you take run_tdnn_1j
from, I only found one in egs/ami/local/tuning
:/
@OscarVanL may I ask where did you take
run_tdnn_1j
from, I only found one inegs/ami/local/tuning
:/
@OscarVanL Thanks for the help so far. How did you handle the missing files in graph/
(disambig_tid.int
, Gr.fst
, and HCLr.fst
). Should HCLG.fst
replace HCLr.fst
? I have generated the Gr.fst as indicated in the Vosk Model Adaptation section before. Not sure if the files that @nshmyrev suggested are enough to try the model on Android. BTW, I didn't encounter the issue with ivector_dim
, so I would suggest for further devs that they try running the run_tdnn_1j
without the fix to 30 first. Thx
Running the script @nshmyrev posted in https://github.com/alphacep/vosk-android-demo/issues/110#issuecomment-743916144 gathered all of the files I needed to run the model in Android. All I needed to add was model.conf
in the conf folder, which I copied from the example pre-trained model.
If you follow the mini_librispeech script with run_tdnn_1j as the training script it should generate all the files you need.
Running the script @nshmyrev posted in #110 (comment) gathered all of the files I needed to run the model in Android. All I needed to add was
model.conf
in the conf folder, which I copied from the example pre-trained model.If you follow the mini_librispeech script with run_tdnn_1j as the training script it should generate all the files you need.
Did you restructure the files after running the script @nshmyrev posted? I indeed not have HCLr.fst
and Gr.fst
(can be generated). I managed to find the disambig_tid.int
under exp/tri3b/graph_tgsmall
(there is another one in exp/chain/tree_sp/graph_tgsmall
)... I'm trying to replicate the same structure as in model-android
:/
I didn't change anything except model.conf. I literally just copy-pasted the output into Android studio and changed the source code's reference to the model path and everything worked! :)
Here's a screenshot of all the files and their structures from Android studio.
I'm going to close this issue as I have successfully trained a model I am satisfied with. Thank you to @nshmyrev for your fantastic assistance, and for your excellent Android example. It has achieved everything I hoped! :)
@RoxanaTapia Please feel free to reply if you have any more questions, I'd be happy to provide my (limited) experience if I can be of any further assistance.
I just tried the trained model but it shows nothing :( not a single word. It just activates somehow... :(
Sorry to hear that :(
If there are no errors from Android perhaps it's a problem with the model you trained rather than the selection of files generated during training.
I'd suggest trawling through the Kaldi logs for any suspicious errors.
Thanks for caring. I really needed some results for my thesis, guess I'm gonna have to postpone the results for the next presentation :(
Here are some details of my experiment.
Data preparation:
PCM/16bit/8Khz/Mono
) using ffmegspk2gender
, spk2utt
,text
, utt2spk
, and wav.scp
)kaldi/egs
for my example and copied the mini_librispeech
example into it/corpus
and put the audio data with matching filepaths in wav.scp
/data
directory and copied there the mapping files for training and test sampleRunning the mini_libri recipe:
data
)mfcc.conf
--allow-upsample=true
because the data was WAV 16bitnj=1
(number of jobs one). This is because the threads are split by speaker I thinkivector_dim
to 30 but kept failing and expect dim 100. Which I could not get to change even changing the value 100 to 30 in every place I found. So I kept 100cmd.sh
to run because I don't have GridEnginerun_tdnn_1j
as suggested aboveAndroid:
Things I'm suspicious about:
I'm gonna consult with my supervisor but I will really appreciate it if someone can help me find out what's wrong :)
I'm also finishing off a fourth-year group dissertation for my degree 👍 Your project sounds interesting.
Some similarities between our experiences:
I also did the same for cmd.sh.
I also did not use .flac files - FYI if you look at the wav.scp file generated by the default mini_librispeech script, this includes commands to convert all of the .flac files into .wavs. So you should not be concerned about this, as the desired format by Kaldi is .wavs.
I also get warnings about too many silence pauses or not enough silence pauses. I think they're safe to ignore.
Here's some points I'd look at:
I wonder if your approach of using AWS transcribe is giving reliable enough transcriptions. Are you sure the presidential speeches aren't already transcribed somewhere that you can scrape/download? (However, attaching these transcriptions to your 1-minute chunks would still be time-consuming.)
Maybe you could have a verification script that compares scraped speech transcripts and AWS transcribed text and searches for differences for you to manually repair.
If you run ./RESULTS in your s5
folder it will calculate the WER on your test subset. What error rate does your model give?
When you run run.sh
, you say it takes 1 day, so I presume you are not monitoring the console output this whole time. If you are not already, you should write the output to a text file so you can read through the output and search for errors. For example, you might run nohup ./run.sh &
to start training and write outputs to nohup.out
.
Perhaps 1-minute chunks are too big? Datasets like LibriTTS usually have much smaller clips. (I doubt this is an issue, however).
How large is your dataset in hours? By the sound of your 300 chunks * 1 minute = just 5 hours? I trained my model on 110 hours + 110 hours augmented.
On this topic, it might be worth merging two datasets. For instance, you could take 100 hours of LibriTTS and merge this with your presidential speech dataset. This approach was very effective for us, as we were only able to compile 45-minutes of our speech dataset.
IMO if there are no obvious training errors my first focus would be on the dataset size. Is there any reason why you couldn't train with speeches + open data?
Furthermore, if you do go down this route of adding more data, you'll have to figure out the GPU situation. I'm sure your University provides shared compute solutions for students.
I found some issues in my mapping files. I'll come back soon :)
Do you mean the spk2gender, spk2utt, text, utt2spk, and wav.scp files?
If you haven't already, check this out. It has lots of detail for how to prepare these files.
Personally I first had a few mistakes in text because of encoding errors (I had to use UTF-8 encoding when reading/writing with my Python script), and also because some transcriptions contained newlines \n which had to be stripped.
Also sorting is very important.
I had the new line issue. I'm debugging now, I wonder it the text
must match the audio files order in wav.scp
Also, I found that words.txt
has only irrelevant words... So I'm thinking in filtering the data from Librispeech to match what the politicians say somehow and enrich with my data, as you suggested
Yes, the orders must match according to that page. In theory if each file has the same selection of utterance IDs, sorting by this field should give the same sorted order.
I just checked and they are sorted. The new line was definitely a big issue... I'll train again
Did that fix it?
Hi, sorry for the delay, I repair the bug of the new line and tried again with one speaker (small set)... It didn't fix the error. Some metrics I got from my data make me think that it might be an error with the length of some sentences (chunk translations). Here are the metrics (of my entire dataset):
I think I should do:
I also need to check that things are ok in Android, so right now I'm going to try to run the mini_librispeech example, and check if I can get it working in android at all. I'm using kaldi-android5.2
with the vosk android demo... There are other things you suggested I also need to check. Thanks for the type with nohup ./run.sh &
I will use it next time.
So, basically, I'm gonna try to run the model on android with open-data and if it works it means that my data needs to be cleaned.
RESULTS
The name of my data set are not matching! :O Maybe that's the error, I will train again
It looks like you have a bit of a messy dataset, some sentences have 1 word, some have 1505 words?!? None of my utterances were as long as 14 minutes.
As for training your own language model, that's beyond my knowledge. Maybe the Kaldi Help forum could help.
I would suggest checking you've got the fundamentals right. Is your dataset standardised into a single format, is it consistent, is it properly labelled, and is Kaldi training without any glaring errors?
As for the RESULTS problem, make sure your training and testing dataset names are correctly set in run_tdnn_1j.sh
, local/nnet3/run_ivector_common.sh
, and run.sh
.
Yep, I was assuming too much while preparing the data. But I found a nice discussion on how I should proceed preparing the data.
For now some utterance/sentences metrics that could be useful (From the test set dev-clean-2
and the training set train-clean-5
included in mini_librispeech
):
Duration (seconds):
- Mean: 10.141
- Min: 1.505
- Max: 31.7
Size:
- Mean: 158K
- Min: 23K
- Max: 495K
Words:
- Mean: 28
- Min: 1
- Max: 88
Vocabulary size: 9138 words.
Hi, I cleaned my data using create_uniform_segments and the Android App is still not showing any results
Metrics about my data:
- 1 speaker
- Speaker says 413 sentences
- 413 sentences/utterances in 413 chunk WAV files
- 24703 words. Per sentence words: Min: 1, Max: 102, AVG: 59.814
- Chunk Duration: Total 3.427 hours. Per sentence duration: Min: 0.242 minutes, Max: 0.5 minutes, AVG: 0.498 minutes
My model is 236.1 Mb in size (had to add org.gradle.jvmargs=-Xmx4096m
to gradle.properties
)
My RESULTS
script:
%WER 55.29 [ 3195 / 5779, 178 ins, 1445 del, 1572 sub ] exp/tri3b/decode_tglarge_test/wer_13_1.0
%WER 60.61 [ 660 / 1089, 45 ins, 279 del, 336 sub ] [PARTIAL] exp/tri3b/decode_tgmed_test/wer_11_0.5
%WER 58.35 [ 3372 / 5779, 165 ins, 1559 del, 1648 sub ] exp/tri3b/decode_tgsmall_test/wer_16_0.0
%WER 68.97 [ 3986 / 5779, 174 ins, 1884 del, 1928 sub ] exp/tri3b/decode_tgsmall_test.si/wer_11_0.0
%WER 36.37 [ 2102 / 5779, 271 ins, 261 del, 1570 sub ] exp/chain/tdnn1j_sp/decode_tglarge_test/wer_8_1.0
%WER 39.99 [ 2311 / 5779, 275 ins, 301 del, 1735 sub ] exp/chain/tdnn1j_sp/decode_tgsmall_test/wer_10_0.0
%WER 37.00 [ 2138 / 5779, 253 ins, 296 del, 1589 sub ] exp/chain/tdnn1j_sp_online/decode_tglarge_test/wer_9_1.0
%WER 41.17 [ 2379 / 5779, 206 ins, 423 del, 1750 sub ] exp/chain/tdnn1j_sp_online/decode_tgsmall_test/wer_10_0.5
I'm running out of ideas :( Did you get anything showing in the Android App @OscarVanL?
Things I need to try:
It seems like I get some results, but nothing shows in the app with the files proposed by @nshmyrev that you show above
It seems like your model has failed to converge, or overfitted the data.
I don't see that there would be any benefit for downsampling to 8kHz, this script was built for use with 16kHz. Are your source files only at 8kHz? Maybe the parameters within this model are tuned for 16kHz.
For the Gr.fst and words.txt, I recommend using a pre-trained 3-gram language model and vocabulary. The mini_librispeech script should download and prepare this for you and uses this language model. I did not even touch the language model, and I suggest you leave it alone too as my gut says your problems lie elsewhere.
Your approach is at odds with the advice @nshmyrev gave me at the beginning of this thread, to use 200-300 hours of speech data to train. (Note: You need to turn down the number of epochs in run_tdnn_1j if you use this much data, or it will take forever to train).
It seems you have two orders of magnitude too small a dataset, if your 1 speaker is the only data you're training on.
Hello, I trained again the models with only filtered mini-libri-speech data. The app still doesn't show anything, not a single word...
Here are my RESULTS:
%WER 12.93 [ 1540 / 11910, 188 ins, 181 del, 1171 sub ] exp/tri3b/decode_tglarge_dev_clean_2/wer_17_0.0
%WER 15.40 [ 1834 / 11910, 180 ins, 266 del, 1388 sub ] exp/tri3b/decode_tgmed_dev_clean_2/wer_17_0.0
%WER 16.94 [ 2018 / 11910, 183 ins, 316 del, 1519 sub ] exp/tri3b/decode_tgsmall_dev_clean_2/wer_17_0.0
%WER 23.36 [ 2782 / 11910, 250 ins, 432 del, 2100 sub ] exp/tri3b/decode_tgsmall_dev_clean_2.si/wer_17_0.0
%WER 7.52 [ 896 / 11910, 110 ins, 92 del, 694 sub ] exp/chain/tdnn1j_sp/decode_tglarge_dev_clean_2/wer_11_0.5
%WER 10.96 [ 1305 / 11910, 136 ins, 157 del, 1012 sub ] exp/chain/tdnn1j_sp/decode_tgsmall_dev_clean_2/wer_10_0.0
%WER 7.55 [ 899 / 11910, 107 ins, 87 del, 705 sub ] exp/chain/tdnn1j_sp_online/decode_tglarge_dev_clean_2/wer_10_0.5
%WER 10.98 [ 1308 / 11910, 135 ins, 158 del, 1015 sub ] exp/chain/tdnn1j_sp_online/decode_tgsmall_dev_clean_2/wer_10_0.0
I'm quite lost at this point. @OscarVanL could it be possible that you share your model just to make sure that my app works at all? I would really appreciate it
Have you tried just cloning this repo, changing nothing and using it with the built-in pre-trained model? Have you considered using a different Android phone?
Yes, I did that before starting the models and now again. It's better in the sense that it transcribes nearly real-time, but the contents make little sense, e.g. the word "and" gets translated to "aws", so it's not really legible. I would estimate a WER of <50% ... The model also looks quite different than the one generated here... The new model only outputs the word "now" or so after listening to 2 mins of audio or so... Other than training my own language model IDK what else could I do, I'm afraid to invest more time in training the language model to not get any useful results afterwards
PS: I haven't check with other phone. CPU Hisilicon Kirin 659, 4Gb memory
I found the built-in example made loads of mistakes for me too with my British accent, my own trained model performed much better.
I think the models you train are very sensitive to accents, LibriTTS/LibriSpeech predominantly consist of American speakers. I created a subset of LibriTTS containing only British speakers for this reason.
When training with a 50/50 split of British and American accents it then recognised my British accent very well (I would have trained entirely on British accents, but there is not enough data).
I don't know what your accent is, but are you testing it with your own speech, or speech similar to that which you are training on?
Hi, Firstly thank you for this demo, it works very well.
I hope to create an ASR model for a type of non-typical speech, British-accented patients that speak using an electrolarynx.
My plan was to do the following:
I have been exploring the mini_librispeech Kaldi example, which you say is the proper way to train a compatible model for Vosk, but am not sure how to tune an existing model.
Do you have any recommendations or scripts for tuning an existing Kaldi model so that it is compatible with Vosk?
Thank you!