Performance degrades after fine-tuning.

I'm interested in using Kaldi to recognize aircraft tailsigns. I used your speech-training-recorder utility to record 600 samples of myself speaking a tailsign, and then used those to run fine-tuning starting from kaldi_model_daanzu_20200905_1ep-mediumlm-base. Each sample is 2-5 seconds long, and contains 4-8 words.

Here is the performance of the base model on the training set before finetuning (as measured by test_model.py):

Overall -> 28.16 % +/- 1.65 % N=2841 C=2120 S=543 D=178 I=79

And after finetuning:

Overall -> 23.69 % +/- 1.56 % N=2841 C=2190 S=180 D=471 I=22

Note again that these statistics are computed on the training set, not a held-out test set. Thus, I hoped to observe significant improvement. While the topline WER is improved, the nature of the errors has changed significantly. The original model has many substitution errors, which are largely homophones (two -> to, etc.). The new model has mostly deletions. For example, here are some transcriptions by the new model:

File: audio_data/recorder_2023-02-07_15-39-50_337121.wav
Ref: gulfstream two two one charlie mike
Hyp: gulfstream charlie mike
File: audio_data/recorder_2023-02-07_15-39-58_369551.wav
Ref: air west seventy four
Hyp: air west
File: audio_data/recorder_2023-02-07_15-39-54_443257.wav
Ref: precision thirty five sixty six
Hyp: precision

I'm curious whether you have any suggestions for what might be going wrong. I saw in the fine-tuning script this note:

frames_per_eg=150,110,100,50  # Standard default is 150,110,100 but try 150,110,100,50 for training with utterances of short commands

I left it as is, with the 50 included, but I'm not sure whether my dataset counts as "short" or if that refers to something with only one or two words per utterance.

I also noticed this note in the instructions:

--num-utts-subset 3000 : You may need this parameter to prevent an error at the beginning of nnet training if your training data contains many short (command-like) utterances. (3000 is a perhaps overly careful suggestion; 300 is the default value.)

I did not use this, and as far as I know, did not encounter an error. Is this a parameter I should try tuning even if I'm not getting an error?

Finally, I'm curious to understand what part of any performance changes might be attributable to the updated acoustic model vs what would be attributable to changes in the language model. I saw that compile_agf_dictation_graph seems to do some work to build a new Dictation.fst - does this incorporate statistics about my training corpus? Is it possible to use the original Dictation.fst and just drop in my new acoustic model to test where the errors might be coming from, or is that going to cause issues of its own?

Thanks!

daanzu / kaldi_ag_training

Performance degrades after fine-tuning. #4