SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for AIFORTHAI - LotusCorpus #449

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 7 months ago

Dataloader name: tha_lotus/tha_lotus.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?tha_lotus

Dataset tha_lotus
Description The Large vOcabualry Thai continUous Speech recognition (LOTUS) corpus was designed for developing large vocabulary continuous speech recognition (LVCSR), spoken dialogue system, speech dictation, broadcast news transcriber. It contains two datasets, one for training acoustic model, another for training a language model.
Subsets -
Languages tha
Tasks Automatic Speech Recognition
License Creative Commons Attribution Non Commercial Share Alike 3.0 (cc-by-nc-sa-3.0)
Homepage https://github.com/korakot/corpus/releases/download/v1.0/AIFORTHAI-LotusCorpus.zip
HF URL -
Paper URL https://doi.org/10.1109/ICSDA.2009.5278377
djanibekov commented 7 months ago

self-assign

bp-high commented 7 months ago

self-assign

sabilmakbar commented 6 months ago

self-assign

sabilmakbar commented 6 months ago

Hi @holylovenia, can it be used as its homepage?

https://github.com/korakot/corpus/tree/main/LOTUS

Or shall we add the homepage of AI for Thai instead?

holylovenia commented 6 months ago

https://github.com/korakot/corpus/tree/main/LOTUS

@sabilmakbar I think this should be fine as our focus is on the dataset rather than the organization.

sabilmakbar commented 5 months ago

Hi sorry, I'm still implementing this, needs a bit more time as the progress of this dataset is ~70% (also found some additional complexity to the dataloader)