Closed oadams closed 4 years ago
This dates back to some of the earliest data-cleaning scripts we had, and very much language-specific. Removing it won't break anything. Preferable to have better support for languages with digit-significance. As you suggest, the check could come later in the pron lexicon stage. I'll make a note to include that in future work.
clean_utterance()
inclean_json.py
removes any utterance that contains a token with a digit in it. This behaviour will break training for languages where digits are used as part of the orthography. For instance, the romanization of Chatino uses digits to indicate tone. As a result all utterances from the corpus are removed.Why was the behaviour there in the first place? Perhaps the idea is that some transcriptions will contain digits for which the pronunciation in the language will be unknown? If that's the case I think the check should be done somewhere else such as where the G2P rules or the pronunciation lexicon are used, since it's possible that there would be pronunciations supplied for digits.
This PR is a work in progress I suppose, because if there was a reason that check for digits was done in the first place then this merge would break something else.