Closed yanirmr closed 1 year ago
I'm going to mark this as won't fix for MFA.
I'm hesitant to include any automated normalization as part of MFA for the following reasons:
I have a number of preprocessing scripts spread across a couple of repos, namely https://github.com/mmcauliffe/corpus-creation-scripts and https://github.com/MontrealCorpusTools/MFA-reorganization-scripts that I would recommend looking at for examples of doing your own tokenization.
Is your feature request related to a problem? Please describe. The model marks words that are not in its vocabulary (OOV) and we discovered that many of these words are numbers. There are year numbers, percentages, numerators, ordinal numbers, whole numbers, and decimal numbers. In order to match these numbers to audio, we thought it would be better if they could be converted into words.
Describe the solution you'd like There are several methods to convert numbers to words. By using SPACY, for example, it is possible to identify the role of a word and then convert accordingly, and there is the possibility of using additional methods as well.