SpeechColab / GigaSpeech

Large, modern dataset for speech recognition
Apache License 2.0
646 stars 62 forks source link

[QUESTION] A few data-related questions #29

Closed GNroy closed 3 years ago

GNroy commented 3 years ago

Thanks for providing a new dataset!

Reading the README left me with a few questions, though:

  1. Which language (s) are these data?
  2. Are the training sets annotated? If yes, what type of annotation was used?
  3. Is the Apache-2.0 license valid for the Youtube subset?
  4. Is the Audiobook subset derived from the LibriVox project? If yes, are there any overlaps with LibriSpeech or LibriLight?

-- Thanks, Aleksandr Laptev

chenguoguo commented 3 years ago

Hey thanks a lot for the questions! We are still finalizing the dataset so things may change... Trying to answer as much as I can:

  1. Language for now is English. But we plan to add support for other languages in the future. We will have to sort out the priorities.
  2. Training set was not manually annotated, we used forced alignment which allows us to work on datasets on a large scale. We did try out best to control the accuracy. Eval set is being manually annotated at the moment.
  3. The Apache2 license only covers the scripts in this repo. As for the dataset, we haven't finalized on the license yet but we would love it to be as open as possible. The constrain/fast is that we don't own the original audio, the original publisher owns it. We only own the derived work, e.g., the segments. Let us know if you have any suggestions on this.
  4. Audiobook subset was derived from the LibriVox project yes. We made sure that it didn't include the LibriSpeech test data, but we didn't go the extra mile to exclude the LibriSpeech or LibriLight training data. So there might be overlaps.
JRMeyer commented 3 years ago

if possible, it would be great to make sure that the original youtube data was released under a creative commons license by the creators, and not the youtube license.

dophist commented 3 years ago

We can't claim right on audio, it will be more like the approach of ImageNet.

JRMeyer commented 3 years ago

what's the approach of imagenet? I'm not familiar with their distribution approach

chenguoguo commented 3 years ago

We updated the README and it's ready for download now. You will have to agree to the Terms of Access.

dophist commented 3 years ago

Since the GigaSpeech dataset has been officially released, the questions listed above are explained in our paper and repo's README. I'm closing this.