CoEDL / elpis

🙊 software for creating speech recognition models.
https://elpis.readthedocs.io/en/latest/
Apache License 2.0
152 stars 33 forks source link

Not deleting utterances that have words with digits in them. #86

Closed oadams closed 4 years ago

oadams commented 4 years ago

clean_utterance() in clean_json.py removes any utterance that contains a token with a digit in it. This behaviour will break training for languages where digits are used as part of the orthography. For instance, the romanization of Chatino uses digits to indicate tone. As a result all utterances from the corpus are removed.

Why was the behaviour there in the first place? Perhaps the idea is that some transcriptions will contain digits for which the pronunciation in the language will be unknown? If that's the case I think the check should be done somewhere else such as where the G2P rules or the pronunciation lexicon are used, since it's possible that there would be pronunciations supplied for digits.

This PR is a work in progress I suppose, because if there was a reason that check for digits was done in the first place then this merge would break something else.

benfoley commented 4 years ago

This dates back to some of the earliest data-cleaning scripts we had, and very much language-specific. Removing it won't break anything. Preferable to have better support for languages with digit-significance. As you suggest, the check could come later in the pron lexicon stage. I'll make a note to include that in future work.