Closed solyarisoftware closed 3 years ago
CV tiny is just a Common Voice with few files for testing the tool.
The others is just just CV and M-AILABS (you can check the code in this repo).
Instead about the text corpus of that link is an old project we did manually, instead the millions of new sentences for Italian in CV are extracted from Wikipedia randomly.
CIao, this is the corpus used to generate the scorer file:
https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/Mitads-1.0.0-alpha2
also you can build on the fly the same scorer using the colab notebook here, just load the deepspeech_lm notebook file:
https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/notebooks
It is written in the release page https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/2020.08.07
And again the [2] is Common Voice with just few files and it isn't a model but just a mini dataset to test the tool.
Thanks
Hi all,
The topic is maybe in
dataset
/help wanted
categories.The problem: I want to build a custom scorer, trying to improve DeepSpeech transcript accuracy in specific closed-domain contexts, extending pre-trained Italian models available in this repo.
The two Italian pre-trained models are:
Here the official documentation about how to build a
.scorer
file:Some more points in the forum:
As far as I understand, you have two option to build the
.scorer
file:Building it from scratch, using just a custom corpus of sentences. In this case, I know how to proceed: I create my custom text file, one sentence per line, and I generate the final scorer just following steps detailed in the above mentioned doc ("reproducing our external scorer" ) [3].
Extending it the pretrained sentences, adding custom sentences. This is my preferred way to go, because I want to enhance the original corpus (of pretrained models) extending it with my closed-domain custom sentences, to verify, at the end f the day, if I get a better overall accuracy in the custom context.
Now, for the English model, the document [3] states the original sentence corpus to be used/integrated is the LibreSpeech Corpus: http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
Question: Where can I find the corresponding Italian corpus, used to build both Italian models [1] and [2]?
I can't find the corpus in this repo, but I found this file: https://github.com/MozillaItalia/voice-web/blob/master/server/data/it/frasi.txt
nevertheless it seems very short. It's maybe just a part? Which file has been used to generate the scorer of pretrained models [1] and [2]?
Thanks giorgio