Open nemtiax opened 1 year ago
@nemtiax Hi! Did you manage to fix the perfomance? I have exactly the same issue. With default config it improves WER but makes KaldiAG unusable for some reason. The utterances get truncated at the end as well, just like you described:
Ref: precision thirty five sixty six
Hyp: precision
I'm interested in using Kaldi to recognize aircraft tailsigns. I used your speech-training-recorder utility to record 600 samples of myself speaking a tailsign, and then used those to run fine-tuning starting from kaldi_model_daanzu_20200905_1ep-mediumlm-base. Each sample is 2-5 seconds long, and contains 4-8 words.
Here is the performance of the base model on the training set before finetuning (as measured by test_model.py):
Overall -> 28.16 % +/- 1.65 % N=2841 C=2120 S=543 D=178 I=79
And after finetuning:
Overall -> 23.69 % +/- 1.56 % N=2841 C=2190 S=180 D=471 I=22
Note again that these statistics are computed on the training set, not a held-out test set. Thus, I hoped to observe significant improvement. While the topline WER is improved, the nature of the errors has changed significantly. The original model has many substitution errors, which are largely homophones (two -> to, etc.). The new model has mostly deletions. For example, here are some transcriptions by the new model:
I'm curious whether you have any suggestions for what might be going wrong. I saw in the fine-tuning script this note:
I left it as is, with the 50 included, but I'm not sure whether my dataset counts as "short" or if that refers to something with only one or two words per utterance.
I also noticed this note in the instructions:
I did not use this, and as far as I know, did not encounter an error. Is this a parameter I should try tuning even if I'm not getting an error?
Finally, I'm curious to understand what part of any performance changes might be attributable to the updated acoustic model vs what would be attributable to changes in the language model. I saw that
compile_agf_dictation_graph
seems to do some work to build a new Dictation.fst - does this incorporate statistics about my training corpus? Is it possible to use the original Dictation.fst and just drop in my new acoustic model to test where the errors might be coming from, or is that going to cause issues of its own?Thanks!