facebookresearch / libri-light

dataset for lightly supervised training using the librivox audio book recordings. https://librivox.org/.
MIT License
480 stars 78 forks source link

Making Supervised Large Datasets for English / German / Spanish #35

Open snakers4 opened 4 years ago

snakers4 commented 4 years ago

Hi,

Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.

My name is Alexander, I am the main author of Open STT and these recent articles from The Gradient:

TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.

On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)

I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.

Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:

You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!

Best, Alexander

snakers4 commented 4 years ago

Also I wonder why do you use flac, but not a modern speech oriented codec like opus? It is a lossless format made for music, it takes much more space than opus.