Closed benfoley closed 4 years ago
Also, is there a shorthand way for using var names that match named keyword arguments? E.g.
deal_with_punctuation(text=word,
punctuation_to_collapse_by=punctuation_to_collapse_by,
punctuation_to_explode_by=punctuation_to_explode_by)
This work is in response to #38
Note that the logging/file clearing suggestions have not been implemented in the branch punct2, only suggested.
This was updated and merged in #82
corpus enhancements
Previously, only a single file named
corpus.txt
would be used as additional corpus data for training. The limitation was in that only this explicitly named file was copied from the uploaded file diroriginal/corpus.txt
intomodel/ID/text-corpora/
in themodel.py
build_kaldi_structure
step (this was added in PR#80) . Although there was code injson_to_kaldi.py
to compile multiple files from thistext-corpora
dir into a single one, it was redundant because only one file would be copied into thetext-corpora
dir.My response to this has been to move the corpus compilation processing from the model
build_kaldi_structure()
preparation step back to the datasetprocess()
step. I like the thought of doing file processing in the dataset step rather than thinking of it as part of the model processing.The flow now goes:
add_fp()
.dataset/ID/text_corpora/
dir.dataset/ID/original/
dir.process()
now iterates thetext_corpora
dir and compiles text files intodataset/ID/cleaned/corpus.txt
file.One day, it would be good to add a GUI element specifically to accept corpus files, which would post files to a new endpoint, allowing the current
dataset/files
endpoint to be dedicated to transcription files. This would enable removing of the limitation of including the word "corpus" in the filenames because we could use adestination
arg to direct the movement of files to either thetext_corpora
ororiginal
dirs. In preparation for this I've added the destination arg to the files endpoint with default value "original".punctuation
Moving the corpus cleaning from model to dataset also allowed the punctuation cleaning to happen solely in the dataset prep stage rather than being in both dataset and model.
Punctuation cleaning is done in two stages. First, one list of punctuation marks are replaced by spaces. Secondly, another list of marks are stripped.
The two lists (
punctuation_to_explode_by
andpunctuation_to_collapse_by
respectively) are specified as config values in datasetinit()
. There is a new GUI element to update theexplode
list, by setting the currentdataset.config['punctuation_to_explode_by']
object via thedataset/settings
endpoint.The cleaning is handled by a new function
deal_with_punctuation()
inclean_json.py
. Probably should rename that file now cause it is actually cleaning more than json. 🤷♂️The
dataset.process()
step cleans each utterance in the transcription json data, viaclean_json_data()
which in turn calls the newdeal_with_punctuation()
function.dataset.process()
also cleans the corpus.txt files, when it compiles the additional corpus files inextract_additional_corpora()
. Punctuation cleaning happens for each line as the additinoal corpus files are read, so we no longer need theclean_corpus_txt()
function, which previously was used to clean thecorpus.txt
file.