clarification on the data

Idlak / Living-Audio-Dataset

A "Crowd-Built" continuously growing speech dataset with transcripts. The dataset contains multiple languages and is intended for anyone to be able to add to it.

Apache License 2.0

41 stars 10 forks source link

clarification on the data #26

Closed naarkhoo closed 3 years ago

naarkhoo commented 5 years ago

I am very happy to land on this git repository. However, I was expecting see something the following. For each language, 1 directory of audio files, and a CSV file corresponding to the text.

What that assumption, it looks you are building something more sophisticated. I am not an expert in the field and trying to do something as a hobby. So I appreciate if you clarify on what my "directory structure" is missing !!

Thanks.

dabraude commented 5 years ago

Hi,

The structure is for each language there are accent subdirectories, and for each language and accent there are speaker sub directories (using 2 letter lanugage codes, 2 letter accent codes, and 3 letter speaker codes) ie: ln/ac/spk for example the English, Received pronunciation speaker, whose code is rbu would be at en/rp/rbu

In that directory is:

text.xml an xml document, which should be able to be interpreted as an ssml document, in this document you will have the transcript for each file.
audiourl the direct download of the audio from the internet archive, you can download it by copying that link in your browser, or with a program like wget, it comes down as a .tar.gz which you can extract with tar in Linux or Mac and 7zip in Windows.

The idea was that you could clone the repo without worrying about downloading several gigs of audio per speaker, and instead only download the audio you want.

There is a tool that uses python 3 to download the audio for a given speaker if you want to use that.

Hope that clarifies everything

naarkhoo commented 5 years ago

Thanks - is there any script to transcribe a long audio (1 hour). I can split the audio based on volume - but how about splitting the text that matches ? (this is not a alignment question).

dabraude commented 5 years ago

No script yet, but if you have the audio and the transcript we could probably figure out, if that is the case.

naarkhoo commented 5 years ago

Well, I am working on Persian and one of the challenges is transcription ! the google API does not perform well, so I am looking around to find alternative ways. One can only look into high score audios from google API - but still those are not perfect and need trimming and etc. So I think, I need to set up a GUI to do that part - unless you have a better idea for low-resource languages.

dabraude commented 5 years ago

That is an open area of research for us. We have been trying a few things and the main issue is getting engagement. With this project at least we have a way of uploading data we do get. In terms of existing tools the best I can recommend is Audacity, Praat, and if you have the money for it ProTools. Wavesurfer is creaking at the seems but it still has some useful functionality.

If you have any ideas as to how to go about getting people to contribute to low resource languages I would happy to collaborate. The other thing to bear in mind is getting transcriptions done is not actually all that expensive.

bpotard commented 5 years ago

If you are using windows (or have access to a windows machine), it may also be worth having a look at Nova: https://github.com/hcmlab/nova

This is more for transcribing / annotating videos, but it can also be used with audio only, and it is fairly easy to use for creating / correcting ASR transcriptions.

Le mer. 21 août 2019 à 10:05, David Braude notifications@github.com a écrit :

That is an open area of research for us. We have been trying a few things and the main issue is getting engagement. With this project at least we have a way of uploading data we do get. In terms of existing tools the best I can recommend is Audacity, Praat, and if you have the money for it ProTools. Wavesurfer is creaking at the seems but it still has some useful functionality.

If you have any ideas as to how to go about getting people to contribute to low resource languages I would happy to collaborate. The other thing to bear in mind is getting transcriptions done is not actually all that expensive.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Idlak/Living-Audio-Dataset/issues/26?email_source=notifications&email_token=ABGYS3ELV74VAWRJKUCZGADQFUAOLA5CNFSM4INYQRAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4Y7P6I#issuecomment-523368441, or mute the thread https://github.com/notifications/unsubscribe-auth/ABGYS3DXVUYSK5UQK6L7URDQFUAOLANCNFSM4INYQRAA .

-- Dr. Blaise Potard Post-doctoral researcher / ARIA-VALUSPA project CereProc LTD "Thinking Technology - Talking Business" CereProc Ltd produces synthesised speech systems that offer individual voices with character and personality Postal Address: Argyle House - CodeBase Floor D; 3 Lady Lawson Street; Edinburgh EH3 9DR

naarkhoo commented 5 years ago

thanks @bpotard - very similar to what I am looking for to find. I really like the https://voice.mozilla.org but of course its not for editing/adjusting. So ideally I need a platform like Mozilla (game-ified, playful) but instead people can adjust/trim voice/texts. They can upload their video/audio, the system split it down, they can invite their team and start building/curating data.

Thanks again for the hint to Nova