Wikidepia / indonesian_datasets

NLP Datasets for Indonesian
102 stars 13 forks source link

How is unsupervised speech compiled? #4

Closed xiaobobo-bilibili closed 2 years ago

xiaobobo-bilibili commented 2 years ago

By unsupervised I mean this one . Can you tell me where did you find these audios and corresponding transcripts?

Wikidepia commented 2 years ago

Most of it comes from anchor.fm and Indonesian TV YouTube Channel. I use ASR to create all of the transcripts.

xiaobobo-bilibili commented 2 years ago

anchor.fm

And may I ask which ASR engine did you use? Is that like third-party online service (Azure, Google Cloud, AWS) or an engine you built? (I'm just curious and trying to evaluate the confidence of your amazing data, not interested in the legal side of issues)

xiaobobo-bilibili commented 2 years ago

Also, how is the segmentation conducted? Did you use a custom VAD module or did you conduct the segmentation using timestamps in subtitle?