MozillaItalia / DeepSpeech-Italian-Model

Tooling for producing Italian model (public release available) for DeepSpeech and text corpus
GNU General Public License v3.0
93 stars 20 forks source link

MITADS - convert numbers to their literal expression #112

Open nefastosaturo opened 3 years ago

nefastosaturo commented 3 years ago

We need to convert:

10 -> dieci 2012 -> duemiladodici 3,14 -> tre virgola quattordici

and so on

https://discourse.mozilla.org/t/converting-numbers-in-textual-form-to-numerical-values-in-stt-output/26321

then we can consider also: #100

eziolotta commented 3 years ago

We could use library N2W-IT

I not sure where it is more convenient to implement this conversion: whether on the MITADS dataset (linguistic model) or in the MITADS-Speech trascription (acoustic model), or both. You think?

In MITADS, now, sentences that contain numbers are excluded (there are Roman number instead), they are excluded in merge_txt.sh Even the importers of MITADS-Speech, or in CommonVoice importer, these types of sentences are excluded.

eziolotta commented 3 years ago

looking better I realized that speech datasets we import, cardinal/ordinal numbers are not present, even if they are then processed in importers (e.g. importer CV). So it is correct to implement conversion in MITADS preprocessing text