SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for thai-romanization #620

Closed SamuelCahyawijaya closed 4 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: thai_romanization/thai_romanization.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?thai_romanization

Dataset thai_romanization
Description The Thai Romanization dataset contains 648,241 Thai words that were transliterated into English, making Thai pronounciation easier for non-native Thai speakers. This is a valuable dataset for Thai language learners and researchers working on Thai language processing task. Each word in the Thai Romanization dataset is paired with its English phonetic representation, enabling accurate pronunciation guidance. This facilitates the learning and practice of Thai pronunciation for individuals who may not be familiar with the Thai script. The dataset aids in improving the accessibility and usability of Thai language resources, supporting applications such as speech recognition, text-to-speech synthesis, and machine translation. It enables the development of Thai language tools that can benefit Thai learners, tourists, and those interested in Thai culture and language.
Subsets -
Languages tha, eng
Tasks Word lists
License Creative Commons Attribution Share Alike 3.0 (cc-by-sa-3.0)
Homepage https://www.kaggle.com/datasets/wannaphong/thai-romanization/data
HF URL -
Paper URL -
muhammadravi251001 commented 5 months ago

self-assign