MozillaItalia / DeepSpeech-Italian-Model

Tooling for producing Italian model (public release available) for DeepSpeech and text corpus
GNU General Public License v3.0
94 stars 20 forks source link

Building a custom external scorer (extending the Italian text corpus) #126

Closed solyarisoftware closed 3 years ago

solyarisoftware commented 3 years ago

Hi all,

The topic is maybe in dataset / help wanted categories.

The problem: I want to build a custom scorer, trying to improve DeepSpeech transcript accuracy in specific closed-domain contexts, extending pre-trained Italian models available in this repo.

The two Italian pre-trained models are:

  1. https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/download/2020.08.07/transfer_model_tensorflow_it.tar.xz
  2. https://github.com/MozillaItalia/DeepSpeech-Italian-Model/files/4610711/cv-it_tiny.tar.gz

Here the official documentation about how to build a .scorer file:

  1. https://deepspeech.readthedocs.io/en/v0.9.3/Scorer.html#reproducing-our-external-scorer

Some more points in the forum:

  1. https://discourse.mozilla.org/t/help-how-to-generate-a-custom-scorer/75045
  2. https://discourse.mozilla.org/t/deepspeech-for-narrow-domain-bot-creation/74361

As far as I understand, you have two option to build the .scorer file:

Now, for the English model, the document [3] states the original sentence corpus to be used/integrated is the LibreSpeech Corpus: http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz

Question: Where can I find the corresponding Italian corpus, used to build both Italian models [1] and [2]?

I can't find the corpus in this repo, but I found this file: https://github.com/MozillaItalia/voice-web/blob/master/server/data/it/frasi.txt

nevertheless it seems very short. It's maybe just a part? Which file has been used to generate the scorer of pretrained models [1] and [2]?

Thanks giorgio

Mte90 commented 3 years ago

CV tiny is just a Common Voice with few files for testing the tool.

The others is just just CV and M-AILABS (you can check the code in this repo).

Instead about the text corpus of that link is an old project we did manually, instead the millions of new sentences for Italian in CV are extracted from Wikipedia randomly.

nefastosaturo commented 3 years ago

CIao, this is the corpus used to generate the scorer file:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/Mitads-1.0.0-alpha2

also you can build on the fly the same scorer using the colab notebook here, just load the deepspeech_lm notebook file:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/notebooks

Mte90 commented 3 years ago

It is written in the release page https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/2020.08.07

And again the [2] is Common Voice with just few files and it isn't a model but just a mini dataset to test the tool.

solyarisoftware commented 3 years ago

Thanks