SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for SEAME #517

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: seame/seame.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?seame

Dataset seame
Description In Singapore and Malaysia, people often speak a mixture of Mandarin and English within a single sentence. We call such sentences intra-sentential code-switch sentences. SEAME is a Mandarin-English codeswitching spontaneous speech corpus.
Subsets train, dev_sge, dev_man
Languages cmn, eng
Tasks Speech-to-Text Translation, Automatic Speech Recognition, Language Identification
License Unknown (unknown)
Homepage https://github.com/zengzp0912/SEAME-dev-set
HF URL -
Paper URL https://www.researchgate.net/profile/Tien-Ping-Tan/publication/221481268_Mandarin-English_code-switching_speech_corpus_in_South-East_Asia_SEAME/links/54cb12f80cf2517b7560ffbd/Mandarin-English-code-switching-speech-corpus-in-South-East-Asia-SEAME.pdf
akhdanfadh commented 5 months ago

A heads up, the dataset is here but I think you need to pay for that(?), cmiiw. The GitHub one is only the list of the audio files.