daanzu / kaldi_ag_training

Docker image and scripts for training finetuned or completely personal Kaldi speech models. Particularly for use with kaldi-active-grammar.
GNU Affero General Public License v3.0
20 stars 4 forks source link

Training Issues #2

Closed bluecamel closed 3 years ago

bluecamel commented 3 years ago

Hi, @daanzu. I've managed to get past a few hurdles and made it to stage 9. I figured one issue would be easier than a few, but I'm also happy to break this up.

convert_tsv_to_scp.py The notes say that passing the lexicon file will filter the utterances that include words not in the lexicon. It does seem to do that, but it filters everything out, so I'm not sure what I've done wrong. Here's the output of adding a debug statement when skipping: lexicon_filter.txt

For now, I've moved on without passing the lexicon file to filter.

Path to audio data. I have kaldi_ag_training and speech-training-recorder checked out in the same directory. When following the recorder instructions, the audio data ends up in a directory named audio_data in the same directory. The script was looking for them in /mnt/input/audio_data, which wasn't there.

The symptom is that the script will fail on step 1 because dataset/text is empty, which in turn is because the audio data wasn't found. Anyhow, the workaround was easy by simply adding another mount to the docker command:

-v "$(pwd)/audio_data:/mnt/audio_data"

I'm happy to make a PR with that added to the docs, but wanted to check if maybe I was just missing something else first.

Stage 4 - files missing from extractor directory This was looking for final.ie, final.dubm, final.mat in /mnt/input/extractor, but they weren't there. I poked around to see if a previous stage was supposed to create them, but found files with those names in the base model. So, I copied them from the base model to the extractor directory before stage 4 runs:

cp /mnt/input/kaldi_model_daanzu_20200905_1ep-mediumlm-base/ivector_extractor/final.ie ${extractor_dir}/
cp /mnt/input/kaldi_model_daanzu_20200905_1ep-mediumlm-base/ivector_extractor/final.dubm ${extractor_dir}/
cp /mnt/input/kaldi_model_daanzu_20200905_1ep-mediumlm-base/ivector_extractor/final.mat ${extractor_dir}/

I wonder if maybe extractor_dir should instead be set to /mnt/input/kaldi_model_daanzu_20200905_1ep-mediumlm-base/ivector_extractor?

Stage 9 - steps/nnet3/chain/get_egs.sh: Number of utterances is very small. Please check your data. This is where I've stopped for the night, so I haven't dug on this yet, but will probably continue tomorrow. Output with the last stacktrace included, fwiw: output.txt

daanzu commented 3 years ago

@bluecamel Thanks!

convert_tsv_to_scp.py

Oops, I had expected the transcripts text to already be lower cased and free of any punctuation. I pushed a fix to sanitize the text during conversion.

Path to audio data

Yes, that is actually how I have it set up for myself, but was trying to simplify things for folks by keeping everything operating out of one directory. My setup is quite byzantine, and I don't wish at on anyone. I missed changing that spot. My hope was to avoid involving complications like rebasing the wav file paths. But I'm not sure the best option. What is your opinion? Either way, we should probably have checking for the wav files in the conversion script to make the error easy and obvious.

Stage 4

Oops, I had accidentally left a line commented out in the front end script from some testing I was doing. I pushed a fix. (Helpfully, some of this is uncovering a few things I needed to fix anyway!)

Stage 9

Yeah, unfortunately Kaldi does expect a certain number of utterances for training. I haven't done much to explore the floor in the number. However, I did push a fix that may help by reverting a change in the default parameters. Give that a try, and let me know.

bluecamel commented 3 years ago

@bluecamel Thanks!

All thanks to you for kaldi :pray:

convert_tsv_to_scp.py

Oops, I had expected the transcripts text to already be lower cased and free of any punctuation. I pushed a fix to sanitize the text during conversion.

Ah, I didn't even think of that, but makes perfect sense. Thanks for the fix!

Path to audio data

Yes, that is actually how I have it set up for myself, but was trying to simplify things for folks by keeping everything operating out of one directory. My setup is quite byzantine, and I don't wish at on anyone. I missed changing that spot. My hope was to avoid involving complications like rebasing the wav file paths. But I'm not sure the best option. What is your opinion?

My first intuition is to separate everything, but a good chunk of that is not knowing if all steps are idempotent and if/when I might need to clean up before running again. Even if that was guaranteed to be handled, I'd still probably want to separate things, though I tend towards crippling fastidiousness at times. :laughing:

I actually started doing this last night, but then stepped aside when I wasn't sure if the issues I was hitting were due to my changes or something else. Now that I know that they weren't related to my changes, I should be able to revive/finish that and share if you're interested.

The basic idea was to have a directory structure like this:

kaldi_ag_custom
kaldi_ag_custom/kaldi_ag_training
kaldi_ag_custom/kaldi_ag_training/scripts/{convert_tsv_to_scp.py,export_trained_model.py,fix_data_dir.sh,run.finetune.sh,...}
kaldi_ag_custom/speech-training-recorder
kaldi_ag_custom/audio_data
kaldi_ag_custom/audio_data_scp
kaldi_ag_custom/kaldi_model_daanzu_20200905_1ep-mediumlm-base
kaldi_ag_custom/train
kaldi_ag_custom/train/{conf,data,exp,extractor,steps,utils}

kaldi_ag_custom/kaldi_ag_training/scripts were copied to the container on build.

My thought was that everything except for kaldi_ag_custom/train (and maybe something like kaldi_ag_custom/model for the final output, though I haven't made it to that point yet, so not sure what that looks like).

My tendency with things like this is to make all of these paths variable but with sane defaults, so that someone could come along, make recordings, and run the script without thinking about it. It seems super close to that already.

That way, you could easily start over by simply creating kaldi_ag_custom/train_b and passing that path to run.*.sh. Of course, for all I know, all of those concerns are silly and nothing needs to change. :)

Either way, we should probably have checking for the wav files in the conversion script to make the error easy and obvious.

:+1:

Stage 4

Oops, I had accidentally left a line commented out in the front end script from some testing I was doing. I pushed a fix. (Helpfully, some of this is uncovering a few things I needed to fix anyway!)

Thanks! I hope that it's more helpful than annoying :smile: I really appreciate all of this and hope that I can help.

Stage 9

Yeah, unfortunately Kaldi does expect a certain number of utterances for training. I haven't done much to explore the floor in the number. However, I did push a fix that may help by reverting a change in the default parameters. Give that a try, and let me know.

lol, I'm feeling really stupid right now. I got a new microphone and decided to record everything again. When I changed the recorder to order the prompts, my intention was to record every single prompt. The only problem was that I forgot to set the prompt count so only recorded 300 total. I remember thinking that I finished quickly, but fooled myself in my excitement to get training going. Haha, I'm going to sit down and record all of the rest now before I get back to training again, but I'll let you know how it goes when I get there.

Thanks so much again for all of this!

daanzu commented 3 years ago

Path to audio data

My original setup was somewhat similar, with a separate work directory that I could swap in and out for different tests. But it was fairly ad hoc and not particularly designed.

All the training stages should be idempotent. However, having separate work directories for personal versus fine tuned training would probably be a good idea, because the directory structure is somewhat different and completely incompatible between them. And it would also make training different data sets easier without having to delete your previous training stages if you don't want to yet.

I would say feel free to try making a directory structure, especially one that makes sense to more of an end user than me πŸ˜‰ . Thanks for the help!

Stage 4

Definitely helpful, especially for finding things I overlooked!

Stage 9

I think the training will want 300 utterances to use as a validation set, and everything after that will be used for training proper. But I don't remember for sure. And maybe adjusting that number lower would be ok; I've never tried that.

bluecamel commented 3 years ago

Path to audio data

My original setup was somewhat similar, with a separate work directory that I could swap in and out for different tests. But it was fairly ad hoc and not particularly designed.

All the training stages should be idempotent. However, having separate work directories for personal versus fine tuned training would probably be a good idea, because the directory structure is somewhat different and completely incompatible between them. And it would also make training different data sets easier without having to delete your previous training stages if you don't want to yet.

I would say feel free to try making a directory structure, especially one that makes sense to more of an end user than me . Thanks for the help!

Sounds good! I'll share what I do when I get there and see what you think. It may be a couple of days before I get recordings done and get back to it.

Stage 4

Definitely helpful, especially for finding things I overlooked!

:smile:

Stage 9

I think the training will want 300 utterances to use as a validation set, and everything after that will be used for training proper. But I don't remember for sure. And maybe adjusting that number lower would be ok; I've never tried that.

Gotcha. I may go ahead and kick off something tomorrow with what I've got and see what it does. Would you think that common rules of thumb from other NNs make sense here as well (such as 80/20 or 70/30 for training/validation)? Though, if I use all of the prompts included with the speech recorder, I guess that's closer to 90/10. :thinking: Heh, I'll play around!

kendonB commented 3 years ago

I still haven't been able to get past stage 1 despite adding a second mount command to the docker call and even duplicating the dataset and audio_data folders.

I'm getting this error

# Stage 1
# Wed Aug 18 23:04:59 UTC 2021

fix_data_dir.sh: no utterances remained: not proceeding further.

Could it be because my training data only has about 300 utterances? What's the minimum?

bluecamel commented 3 years ago

I still haven't been able to get past stage 1 despite adding a second mount command to the docker call and even duplicating the dataset and audio_data folders.

I'm getting this error

# Stage 1
# Wed Aug 18 23:04:59 UTC 2021

fix_data_dir.sh: no utterances remained: not proceeding further.

Could it be because my training data only has about 300 utterances? What's the minimum?

My memory could be fuzzy, but I'm pretty sure that this is what I saw when it couldn't find the audio data. Is dataset/text empty?

If I'm not mistaken, it's going to depend on two things:

What are the paths like in audio_data/recorder.tsv? If they look like ../audio_data/recorder_2021-08-14_21-48-07_873960.wav, then your audio data will need to be mounted in the same parent directory as the input directory (e.g. -v "$(pwd)/audio_data:/mnt/audio_data" -v $(pwd):/mnt/input)

kendonB commented 3 years ago
-rw-rw-r--  1 kendonb kendonb    0 Aug 19 10:48 text
-rw-rw-r--  1 kendonb kendonb    0 Aug 19 10:48 utt2spk
-rw-rw-r--  1 kendonb kendonb    0 Aug 19 10:48 wav.scp

indeed all three files here are empty! So there needs to be a check for that in convert_tsv_to_scp.py @daanzu

kendonB commented 3 years ago

Still struggling to fiddle around with this to get it to work. @bluecamel you wouldn't be able to post a complete list of commands needed to make this work? Even starting from the git clones?

e.g. like this but fixed

# Record training data
git clone https://github.com/daanzu/speech-training-recorder
cd speech-training-recorder
mkdir ../audio_data
pip install -r requirements.txt
python3 recorder.py -p prompts/timit.txt
# Record X utterances

# Set up training repo
cd ..
git clone https://github.com/daanzu/kaldi_ag_training
cd kaldi_ag_training
wget https://github.com/daanzu/kaldi_ag_training/releases/download/v0.1.0/kaldi_model_daanzu_20200905_1ep-mediumlm-base.zip
unzip kaldi_model_daanzu_20200905_1ep-mediumlm-base
cp -r ../audio_data audio_data

# may be required to get nvidia docker working (run all 3 with sudo)
# apt install -y nvidia-docker2
# systemctl daemon-reload
# systemctl restart docker

# train

###*** THIS FAILS FOR ME ***### results in an empty dataset
python3 convert_tsv_to_scp.py -l kaldi_model_daanzu_20200905_1ep-mediumlm-base/dict/lexicon.txt audio_data/recorder.tsv dataset 

# This runs but stops at stage 1
docker run -it --rm -v "$(pwd)/audio_data:/mnt/audio_data" -v $(pwd):/mnt/input -w /mnt/input --user "$(id -u):$(id -g)" --runtime=nvidia daanzu/kaldi_ag_training_gpu bash run.personal.sh kaldi_model_daanzu_20200905_1ep-mediumlm-base dataset
bluecamel commented 3 years ago

Hey, @kendonB. I have something like that that's a work in progress with the goal of making it easier. It's more notes for me than instructions, but I think that it should get you going for now.

It's very light on details yet excessive with variables, but if you look closely at all of the paths, it hopefully will show you where things need to be.

A couple of notes:

kendonB commented 3 years ago

So your instructions are different to those in the repo here:

python3 convert_tsv_to_scp.py "${KAGT_AUDIO_DATA}/recorder.tsv" "${KAGT_TRAINING_DATASET}"
WARNING: No lexicon file specified.
Wrote training dataset to /home/kendonb/kaldi_ag_custom/kaldi_ag_training/dataset

By running it as per the repo README:

python3 convert_tsv_to_scp.py -l kaldi_model_daanzu_20200905_1ep-mediumlm-base/dict/lexicon.txt "${KAGT_AUDIO_DATA}/recorder.tsv" "${KAGT_TRAINING_DATASET}"

we end up with empty files:

wc -l "${KAGT_TRAINING_DATASET}/text"
0 /home/kendonb/kaldi_ag_custom/kaldi_ag_training/dataset/text

By omitting the lexicon file we end up with a nonempty dataset. cc @daanzu

bluecamel commented 3 years ago

So your instructions are different to those in the repo here:

Yeah, that's one of the issues that I mentioned in the first post here. @daanzu said that he pushed a fix, but it didn't work for me and I didn't dig in yet. I'm working off of the assumption that the included prompts wouldn't include any words that aren't in the lexicon. If that's not true, I'd suggest that those are just removed from the prompts files.

kendonB commented 3 years ago

when I run:

docker run -it --rm -v "${KAGT_TRAINING}:/mnt/input" -v "${KAGT_AUDIO_DATA}:/mnt/audio_data" -v "${KAGT_HOME}/${KAGT_MODEL_BASE}:/mnt/kaldi_model_base" -w /mnt/input --user "$(id -u):$(id -g)" --runtime=nvidia daanzu/kaldi_ag_training_gpu bash run.finetune.sh /mnt/kaldi_model_base dataset

I get:

+ nice_cmd='nice ionice -c idle'
+ [[ 2 -ge 2 ]]
+ model=/mnt/input//mnt/kaldi_model_base
+ shift
+ dataset=/mnt/input/dataset
+ shift
+ [[ -d /mnt/input//mnt/kaldi_model_base ]]
+ exit 1

So it looks like it's mounting the directories below /mnt/input

kendonB commented 3 years ago

I got it to start by just omitting the second two mounting commands and got this:

# Stage 1
# Fri Aug 20 01:48:00 UTC 2021

fix_data_dir.sh: kept all 319 utterances.
fix_data_dir.sh: old files are kept in data/finetune/.backup
steps/make_mfcc.sh --cmd utils/run.pl --nj 12 --mfcc-config conf/mfcc.conf data/finetune exp/make_mfcc_chain/finetune.log exp/make_mfcc_chain
utils/validate_data_dir.sh: Successfully validated data-directory data/finetune
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
run.pl: 12 / 12 failed, log is in exp/make_mfcc_chain/finetune.log/make_mfcc_finetune.*.log
Failed to open file ../audio_data/recorder_2021-08-19_10-41-21_793828.wav

The log contains this ^ so it should be sorted once I figure out the right mounting command.

bluecamel commented 3 years ago

Oops, I thought I'd updated that. I'm afk for a bit, but I updated the command and will double check when I get back.

bluecamel commented 3 years ago

Okay, just confirmed that it runs like that. I'm really curious what you get. So far I've got a model that doesn't detect anything, but I also only have ~1600 recordings right now.

kendonB commented 3 years ago

When you say "like that", what are you referring to? What's the exact command that works for you?

bluecamel commented 3 years ago

I'm referring to the updated command in the instructions linked earlier.

daanzu commented 3 years ago

@kendonB Sorry for the trouble. I should have a chance to test a bit more tonight. I think I will add a bit of logic to just search and find the wave files and fix up the paths to eliminate this as an issue.

@bluecamel

Would you think that common rules of thumb from other NNs make sense here as well (such as 80/20 or 70/30 for training/validation)? Though, if I use all of the prompts included with the speech recorder, I guess that's closer to 90/10. πŸ€” Heh, I'll play around!

The recommended Kaldi scripts seem to generally stick to just the absolute value of 300 utterances for validation. But you shouldn't have to worry about splitting up the training into sets. Kaldi nnet can be a bit different than usual, especially in some of the terminology. For example, the "epochs" parameter can be deceptive: the rule-of-thumb value of 5 epochs actually ends up being effectively more like 50 iterations over the data.

bluecamel commented 3 years ago

The recommended Kaldi scripts seem to generally stick to just the absolute value of 300 utterances for validation. But you shouldn't have to worry about splitting up the training into sets. Kaldi nnet can be a bit different than usual, especially in some of the terminology. For example, the "epochs" parameter can be deceptive: the rule-of-thumb value of 5 epochs actually ends up being effectively more like 50 iterations over the data.

Thanks for the info. Does it make sense that I would get a model that doesn't detect anything with 1600 recorded prompts? I assume that that's quite low, and am working on recording more, but just making sure I'm on track otherwise.

Also, can you confirm which G.fst file is to be used for the compile_agf_dictation_graph step? I've tried both the one from the base model and the new exported model, but can't tell any difference (though, again, I'm assuming that I just need more data).

kendonB commented 3 years ago

OK! I have got to the small number of utterances error which is good news. Thanks so much @bluecamel and @daanzu for the guidance.

I would have thought that the fine-tuning option would still work with a small number of additional prompts, especially I would have hoped it wouldn't get worse at interpreting your voice. so I'm guessing it's a config issue causing the problem that your model doesn't detect anything.

daanzu commented 3 years ago

@bluecamel Bad model: Hmm, that sounds odd. I can't remember what the minimum amount of training I have tried with. I will run a test when I get a chance.

G.fst: They should be the same. The generated model directory is mostly just stuff copied from the source directory, including G.fst.

bluecamel commented 3 years ago

@daanzu Thanks for confirming. So, I may have actually got a good model, but didn't know it!

I got curious if maybe my audio data was just bad in some way. I wrote a little script that reads 'recorder.tsv', feeds the audio to kaldi-active-grammar plain dictation, and compares the result to the prompt. I figured that if it was detecting some percentage of words correctly, then my data was at least not terrible.

I ran it against different models, including the "medium" model that was in my kaldi environment when I ran training. Exact, word for word matches on almost everything. Wait, did the last step actually update that model and not the exported model? I unzipped a fresh copy of the original "medium" model and ran the test script. Decent results, but not exact.

When I trained, I would switch the kaldi model to the new exported model directory, which wasn't detecting anything. I thought that that would be the new model.

Does that sound right? It seems that I've proven that, but I'm having trouble believing that the new model is performing this well and I'm just missing some fluke.

Ignore me. I did indeed fool myself.

kendonB commented 3 years ago

@bluecamel did you end up getting this to work?

bluecamel commented 3 years ago

@kendonB Sorry, but no. I tried for a few nights, but I don't know enough about KAG to figure out what's wrong. I needed a solution in the short term, so moved to DNS, but I'd like to come back to it eventually. I'd love to use it and ditch the VM.

bluecamel commented 3 years ago

I guess I'll close this since I'm not going to have time to dive deeper for a while. 😞