MozillaItalia / DeepSpeech-Italian-Model

Tooling for producing Italian model (public release available) for DeepSpeech and text corpus
GNU General Public License v3.0
94 stars 20 forks source link

MITADS - Transcript roman numbers #100

Open Mte90 opened 4 years ago

Mte90 commented 4 years ago

We have the issue that the text corpus include roman numbers but we need to convert those as usual numbers but also to spot fake positives and so on.

We need a way to detect roman numbers and not other text that include that letters.

ilyasmg commented 4 years ago

I see that there's a roman_numbers.py script. What's the problem? It's not accurate enough?

Mte90 commented 4 years ago

It isn't perfect we had various fake positive with that.

eziolotta commented 3 years ago

Which importers do you have more sentences with Roman numbers? About Ted Importer there is an issue refer roman number. In function _maybenormalize (ted_importer.py) parameter _romannormalization is False so function _do_romannormalization is not performed ( see utils.roman_numbers)

Mte90 commented 3 years ago

We removed in ted that normalization because had a lot of fake positives