compile corpora in dataset, explode/collapse punc handling

benfoley commented 4 years ago

corpus enhancements

Previously, only a single file named corpus.txt would be used as additional corpus data for training. The limitation was in that only this explicitly named file was copied from the uploaded file dir original/corpus.txt into model/ID/text-corpora/ in the model.py build_kaldi_structure step (this was added in PR#80) . Although there was code in json_to_kaldi.py to compile multiple files from this text-corpora dir into a single one, it was redundant because only one file would be copied into the text-corpora dir.

My response to this has been to move the corpus compilation processing from the model build_kaldi_structure() preparation step back to the dataset process() step. I like the thought of doing file processing in the dataset step rather than thinking of it as part of the model processing.

The flow now goes:

Files are uploaded via dataset add_fp().
If files have the word "corpus" in their name, they are written to dataset/ID/text_corpora/ dir.
All other files uploaded are written to dataset/ID/original/ dir.
Dataset process() now iterates the text_corpora dir and compiles text files into dataset/ID/cleaned/corpus.txt file.

One day, it would be good to add a GUI element specifically to accept corpus files, which would post files to a new endpoint, allowing the current dataset/files endpoint to be dedicated to transcription files. This would enable removing of the limitation of including the word "corpus" in the filenames because we could use a destination arg to direct the movement of files to either the text_corpora or original dirs. In preparation for this I've added the destination arg to the files endpoint with default value "original".

punctuation

Moving the corpus cleaning from model to dataset also allowed the punctuation cleaning to happen solely in the dataset prep stage rather than being in both dataset and model.

Punctuation cleaning is done in two stages. First, one list of punctuation marks are replaced by spaces. Secondly, another list of marks are stripped.

The two lists (punctuation_to_explode_by and punctuation_to_collapse_by respectively) are specified as config values in dataset init(). There is a new GUI element to update the explode list, by setting the current dataset.config['punctuation_to_explode_by'] object via the dataset/settings endpoint.

The cleaning is handled by a new function deal_with_punctuation() in clean_json.py. Probably should rename that file now cause it is actually cleaning more than json. 🤷‍♂️

The dataset.process() step cleans each utterance in the transcription json data, via clean_json_data() which in turn calls the new deal_with_punctuation() function.

dataset.process() also cleans the corpus.txt files, when it compiles the additional corpus files in extract_additional_corpora(). Punctuation cleaning happens for each line as the additinoal corpus files are read, so we no longer need the clean_corpus_txt() function, which previously was used to clean the corpus.txt file.

benfoley commented 4 years ago

Also, is there a shorthand way for using var names that match named keyword arguments? E.g.

deal_with_punctuation(text=word,
    punctuation_to_collapse_by=punctuation_to_collapse_by,
    punctuation_to_explode_by=punctuation_to_explode_by)

benfoley commented 4 years ago

This work is in response to #38

nicklambourne commented 4 years ago

Note that the logging/file clearing suggestions have not been implemented in the branch punct2, only suggested.

benfoley commented 4 years ago

This was updated and merged in #82

CoEDL / elpis

compile corpora in dataset, explode/collapse punc handling #81

corpus enhancements

punctuation