SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for MEDISCO #591

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: medisco/medisco.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?medisco

Dataset medisco
Description MEDISCO is a Medical Indonesian Speech Corpus. The medical text corpus is collected from five Indonesian online medical consultation websites. From the text corpus, we created a speech corpus that consists of 360 sentences read by 13 speakers. In total, our speech corpus contains 731 medical terms and consists of 4,680 utterances with a total duration of 10 hours.
Subsets Train, Test
Languages ind
Tasks Automatic Speech Recognition
License GNU General Public License v3.0 (gpl-3.0)
Homepage https://huggingface.co/datasets/mrqorib/MEDISCO
HF URL https://huggingface.co/datasets/mrqorib/MEDISCO
Paper URL https://ieeexplore.ieee.org/abstract/document/8629259
akhdanfadh commented 3 months ago

self-assign

mrqorib commented 3 months ago

self-assign

@akhdanfadh Sorry would you mind giving this to me? This is my dataset 😆

akhdanfadh commented 3 months ago

Sure! @mrqorib

mrqorib commented 3 months ago

@akhdanfadh Thanks! 😊