MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.34k stars 247 forks source link

Number normalization #509

Closed yanirmr closed 1 year ago

yanirmr commented 2 years ago

Is your feature request related to a problem? Please describe. The model marks words that are not in its vocabulary (OOV) and we discovered that many of these words are numbers. There are year numbers, percentages, numerators, ordinal numbers, whole numbers, and decimal numbers. In order to match these numbers to audio, we thought it would be better if they could be converted into words.

Describe the solution you'd like There are several methods to convert numbers to words. By using SPACY, for example, it is possible to identify the role of a word and then convert accordingly, and there is the possibility of using additional methods as well.

mmcauliffe commented 1 year ago

I'm going to mark this as won't fix for MFA.
I'm hesitant to include any automated normalization as part of MFA for the following reasons:

  1. The risk of introducing noisy/garbage data unknowingly is so high in my experience
  2. Any tokenizer is going to be limited in its coverage and there are so many language specific packages out there with varying degrees of support for OS/cuda etc
  3. I wouldn't be able to really test and support a decent range of languages, so I think it would do more harm than the introduction of OOVs (which won't impact training/alignment as much as completely mismatched phone strings)
  4. It's just so dataset specific, that I really think it should be done as a preprocessing step to training/aligning, and should involve a fair bit of spot checking to ensure the tokenization system has done what it's supposed to have done.

I have a number of preprocessing scripts spread across a couple of repos, namely https://github.com/mmcauliffe/corpus-creation-scripts and https://github.com/MontrealCorpusTools/MFA-reorganization-scripts that I would recommend looking at for examples of doing your own tokenization.