SpeechColab / GigaSpeech

Large, modern dataset for speech recognition
Apache License 2.0
649 stars 62 forks source link

duplicates and some youtube video links are wrong. Observations [not issue] #129

Open npovey opened 1 year ago

npovey commented 1 year ago

There are: youtube opus files = 21472 unique youtube video links to videos = 21127 21472 - 21127 = 345 For these 345 files I noticed:

  1. that in some cases opus files given the wrong youtube link in json file.
  2. I found one opus file that has a duplicate opus file. Possible there are more duplicates. Duplicates example:
    ├── id_ZZZ7k8cMA-4
    │   ├── YOU0000010703.opus
    │   └── YOU0000012437.opus 

All opus files have correct captions so the observations above are not a big deal. Just wanted to mention my findings. Thanks