falabrasil / speech-datasets

🗣️🇧🇷 Bases de áudio transcrito em Português Brasileiro
42 stars 7 forks source link

Are dvc files still available? #4

Closed bekirbakar closed 2 years ago

bekirbakar commented 2 years ago

After going through your guide and running dvc pull, I get following error.

ERROR: unexpected error - : <HttpError 404 when requesting https://www.googleapis.com/drive/v2/files/1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt?fields=driveId&supportsAllDrives=true&alt=json returned "File not found: 1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt". Details: "[{'message': 'File not found: 1ijAenbwQnMhwoQj5x35f9-kg1jKd-nUt', 'domain': 'global', 'reason': 'notFound', 'location': 'file', 'locationType': 'other'}]">

Does this mean files are not no longer available or I am having authentication issues?

Thanks in advance.

cassiotbatista commented 2 years ago

Hi,

I'm afraid they're not available indeed. My university has canceled the contract with Google and we lost storage space in GDrive. I'm still looking for an alternative but unfortunately we don't have a solid plan or a deadline.

For now I believe some of the open corpora can still be downloaded from GitLab: https://gitlab.com/fb-audio-corpora

bekirbakar commented 2 years ago

Thanks a lot for the response. I admire your work.

In your kaldi-br scripts, you assume folders are in a datasets directory with exact names like items of List 1. If I download and use corpus from fb-audio-corpora how I match them? Does folder, file structure (tree) inside of dataset matter? For instance, which item of List 2 is cetuc?

Let's say I trained a model with your recipe and now I want to compare/check my result with yours (your vosk model will be a baseline for my work)? How do I make sure that we used the same data? I mean how do I make sure that I succeed to re-produce?

List 1: List of Files on Google Drive

List 2: Files on Gitlab (fb-audio-corpora)

cassiotbatista commented 2 years ago

Hi,

Just made CETUC public on Gitlab. Don't use LapsMail nor dectalk11k: the transcriptions of the former were automatically generated with a poor ASR model, while the latter is a synthetic dataset in English. Spoltech and Westpoint are private, we can't share. I'm not sure about LapsStory license so I won't be sharing either.

VoxForge, Common Voice, etc can all be externally downloaded from the web.

If you download and extract all datasets into a single root folder named datasets the scripts should work fine.

To reproduce you'd have to follow our fb-falabrasil recipe as is, and get the same WER values as in https://github.com/falabrasil/kaldi-br/tree/master/fb-falabrasil. But I'm afraid that without all the datasets the numbers won't exactly match. It's worth a try, tho. I think CORAA is the largest corpus among all, so you won't be missing much by not using the private data.

If you're baby stepping into Kaldi, I'd recommend fb-lapsbm recipe: it trains a model using only LapsBM data (1h of audio).