Number normalization - Githubissues

I'm going to mark this as won't fix for MFA.
I'm hesitant to include any automated normalization as part of MFA for the following reasons:

The risk of introducing noisy/garbage data unknowingly is so high in my experience
Any tokenizer is going to be limited in its coverage and there are so many language specific packages out there with varying degrees of support for OS/cuda etc
I wouldn't be able to really test and support a decent range of languages, so I think it would do more harm than the introduction of OOVs (which won't impact training/alignment as much as completely mismatched phone strings)
It's just so dataset specific, that I really think it should be done as a preprocessing step to training/aligning, and should involve a fair bit of spot checking to ensure the tokenization system has done what it's supposed to have done.

I have a number of preprocessing scripts spread across a couple of repos, namely https://github.com/mmcauliffe/corpus-creation-scripts and https://github.com/MontrealCorpusTools/MFA-reorganization-scripts that I would recommend looking at for examples of doing your own tokenization.

MontrealCorpusTools / Montreal-Forced-Aligner