TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.
I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.
Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:
You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!
Also I wonder why do you use flac, but not a modern speech oriented codec like opus?
It is a lossless format made for music, it takes much more space than opus.
Hi,
Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.
My name is Alexander, I am the main author of Open STT and these recent articles from The Gradient:
TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.
On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)
I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.
You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!
Best, Alexander