Building a custom external scorer (extending the Italian text corpus)

solyarisoftware commented 3 years ago

Hi all,

The topic is maybe in dataset / help wanted categories.

The problem: I want to build a custom scorer, trying to improve DeepSpeech transcript accuracy in specific closed-domain contexts, extending pre-trained Italian models available in this repo.

The two Italian pre-trained models are:

Here the official documentation about how to build a .scorer file:

https://deepspeech.readthedocs.io/en/v0.9.3/Scorer.html#reproducing-our-external-scorer

Some more points in the forum:

As far as I understand, you have two option to build the .scorer file:

Building it from scratch, using just a custom corpus of sentences. In this case, I know how to proceed: I create my custom text file, one sentence per line, and I generate the final scorer just following steps detailed in the above mentioned doc ("reproducing our external scorer" ) [3].
Extending it the pretrained sentences, adding custom sentences. This is my preferred way to go, because I want to enhance the original corpus (of pretrained models) extending it with my closed-domain custom sentences, to verify, at the end f the day, if I get a better overall accuracy in the custom context.

Now, for the English model, the document [3] states the original sentence corpus to be used/integrated is the LibreSpeech Corpus: http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz

Question: Where can I find the corresponding Italian corpus, used to build both Italian models [1] and [2]?

I can't find the corpus in this repo, but I found this file: https://github.com/MozillaItalia/voice-web/blob/master/server/data/it/frasi.txt

nevertheless it seems very short. It's maybe just a part? Which file has been used to generate the scorer of pretrained models [1] and [2]?

Thanks giorgio

Mte90 commented 3 years ago

CV tiny is just a Common Voice with few files for testing the tool.

The others is just just CV and M-AILABS (you can check the code in this repo).

Instead about the text corpus of that link is an old project we did manually, instead the millions of new sentences for Italian in CV are extracted from Wikipedia randomly.

nefastosaturo commented 3 years ago

CIao, this is the corpus used to generate the scorer file:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/Mitads-1.0.0-alpha2

also you can build on the fly the same scorer using the colab notebook here, just load the deepspeech_lm notebook file:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/notebooks

Mte90 commented 3 years ago

It is written in the release page https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/2020.08.07

And again the [2] is Common Voice with just few files and it isn't a model but just a mini dataset to test the tool.

solyarisoftware commented 3 years ago

Thanks

MozillaItalia / DeepSpeech-Italian-Model

Building a custom external scorer (extending the Italian text corpus) #126