Closed bekirbakar closed 2 years ago
Hi,
I'm afraid they're not available indeed. My university has canceled the contract with Google and we lost storage space in GDrive. I'm still looking for an alternative but unfortunately we don't have a solid plan or a deadline.
For now I believe some of the open corpora can still be downloaded from GitLab: https://gitlab.com/fb-audio-corpora
Thanks a lot for the response. I admire your work.
In your kaldi-br scripts, you assume folders are in a datasets directory with exact names like items of List 1. If I download and use corpus from fb-audio-corpora how I match them? Does folder, file structure (tree) inside of dataset matter? For instance, which item of List 2 is cetuc?
Let's say I trained a model with your recipe and now I want to compare/check my result with yours (your vosk model will be a baseline for my work)? How do I make sure that we used the same data? I mean how do I make sure that I succeed to re-produce?
Hi,
Just made CETUC public on Gitlab. Don't use LapsMail nor dectalk11k: the transcriptions of the former were automatically generated with a poor ASR model, while the latter is a synthetic dataset in English. Spoltech and Westpoint are private, we can't share. I'm not sure about LapsStory license so I won't be sharing either.
VoxForge, Common Voice, etc can all be externally downloaded from the web.
If you download and extract all datasets into a single root folder named datasets
the scripts should work fine.
To reproduce you'd have to follow our fb-falabrasil
recipe as is, and get the same WER values as in https://github.com/falabrasil/kaldi-br/tree/master/fb-falabrasil. But I'm afraid that without all the datasets the numbers won't exactly match. It's worth a try, tho. I think CORAA is the largest corpus among all, so you won't be missing much by not using the private data.
If you're baby stepping into Kaldi, I'd recommend fb-lapsbm
recipe: it trains a model using only LapsBM data (1h of audio).
After going through your guide and running
dvc pull
, I get following error.ERROR: unexpected error - : <HttpError 404 when requesting https://www.googleapis.com/drive/v2/files/1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt?fields=driveId&supportsAllDrives=true&alt=json returned "File not found: 1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt". Details: "[{'message': 'File not found: 1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt', 'domain': 'global', 'reason': 'notFound', 'location': 'file', 'locationType': 'other'}]">
Does this mean files are not no longer available or I am having authentication issues?
Thanks in advance.