Training program/mode for the Pocket Sphinx engine

drmfinlay commented 6 years ago

A method for easily training the acoustic model used for the Pocket Sphinx engine would make it much more useful than it currently is. The default US English model is trained for (essentially) dictation/prose use, rather than for sequences of commands. This is (partly) why the accuracy is less than ideal.

The idea I have in mind is to train the active model using data from the engine, rather than by recording your voice manually. So as the engine is used, recorded audio and speech hypothesises could be stored in files in some configurable folder, perhaps a folder under MacroSystem.

The user could at some point say "start training" (or similar) to start a GUI training program that would have:

[ ] an ordered list of recognised phrases - phrases to train, loaded from the folder mentioned above
[ ] a button for playing the .wav file associated with the selected phrase (using PyAudio)
[ ] (crucially) a way to edit the phrases (by double clicking or using an Edit button) if Pocket Sphinx got it wrong
[ ] a button for deleting the selected phrase and associated .wav file
[ ] a status bar showing whether phrases were successfully saved after editing or deleted after the delete button was pressed
[ ] a button to run the .wav files and phrases through the training process using SphinxTrainingHelper
[ ] a log window for the training process
[ ] keyboard shortcuts for each button

Some notes

Tkinter could be used to make a cross-platform GUI.
SphinxTrainingHelper requires some programs from sphinxtrain which can be checked for when the training program starts.
- The programs required are bw, map_adapt, mllr_solve, mk_s2sendump and mllr_transform. All of them are in /usr/lib/sphinxtrain for me on Debian 9.
- os.walk could be used to find the required programs in common locations on Linux/Unix systems.
- It might be a good idea to eventually redistribute the binaries for Windows at least, as compiling both sphinxtrain and its sphinxbase dependency from source is a bit of a pain on Windows systems.
Sphinxtrain should allow you to specify noise in training phrases by using [NOISE].
The example for using PyAudio to record into a .wav file will be useful in making the engine record spoken phrases into files.
Phrases spoken when the training program is running shouldn't be added to the list of training phrases because it would be confusing.
The training program should run as a separate process from the engine/loader.
Training and audio files present when the training program starts would be moved to another directory for training, deleted if the training process is successful or moved back if unsuccessful.
The engine's Pocket Sphinx decoder would need to be reinitialised with the adapted model after successful training (engine restart).
Additional documentation would need to be added for common issues with training, engine configuration, etc.

This is pretty ambitious. I'm definitely open to ideas, feedback and help on this, especially on the GUI part.

LexiconCode commented 6 years ago

an ordered list of recognised phrases - phrases to train, loaded from the folder mentioned above

Could this be read from the grammars then generated into a list?

drmfinlay commented 6 years ago

Yes, the phrases can be generated from grammars. Did you mean as an alternative to using .wav files and recognised phrases from the engine?

It might be useful to have a training mode where recognition processing does not occur. For example, I could say 'start training session', say a bunch of commands in sequences that I want to train, then finish by saying 'end training session', after which the training program could be opened. Contexts would still work, but no action would be executed.

LexiconCode commented 6 years ago

That clarified for me recognized phrases which are loaded from grammars. I like your good idea how to implement the commands for training while maintaining their context.

drmfinlay commented 6 years ago

The recording speech and hypothesises part is done now. The engine's default behaviour is to create a training folder in the same folder as the module loader, write audio to trainingXXXX.wav files and hypothesises to training.transcriptions. The training.fileids file is also created. The files should work with the SphinxTrainingHelper script.

drmfinlay commented 6 years ago

The training session mode has been added to develop now. You can say "start training session" to start it or call SphinxEngine.start_training_session(). To end a session, you can say "end training session" or call SphinxEngine.end_training_session(). "end training session" doesn't start the training program GUI because it does not exist yet.

As I said in an above comment, the only difference between this mode and normal use is that no action or rule processing takes place. Keyphrases are exempt from this. Grammar contexts should still be taken into account.

A good use case for this mode is training commands that take a long time to execute their actions or are dangerous. Perhaps such commands keep getting falsely recognised and they need more training.

The phrases, threshold values and training data directory are configurable in the engine config module. If you set TRAINING_DATA_DIR to None in the engine config, no .wav files or transcript files will be created or written to. A warning will be displayed if a training session is started and this is the case.

I'll label this with Documentation because this feature will need to be properly documented at some point.

drmfinlay commented 4 years ago

I don't think this is really needed. Even if it was, I cannot commit to working on this. I hope my ideas above and in Dragonfly's source code are helpful for those attempting the task in the future.

As it is a considerably easier task than this, I still plan on unifying the training data output by the Pocket Sphinx engine to be similar to the output from other engines (see #196).

dictation-toolbox / dragonfly

Training program/mode for the Pocket Sphinx engine #22

Some notes