SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for LEXiTRON #614

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: lexitron/lexitron.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lexitron

Dataset lexitron
Description Corpus-based dictionary of Thai and English languages. This dataset contains frequently-used words from trusted publications such as novels, academic documents and newspaper. The dataset link contains Thai-English and English-Thai lexicons. Thai-English vocabulary consists of vocabulary, type of word (part of speech), translation, synonym (synonym) and sample sentences with a list of Thai-> English words, 53,000 words and English vocabulary list -> Thai, 83,000 words. See more details at http://lexitron.nectec.or.th.
Subsets version 2.0
Languages tha, eng
Tasks Word-level Translation, Machine Translation
License Custom NECTEC license
Homepage https://opend-portal.nectec.or.th/dataset/lexitron-2-0
HF URL -
Paper URL -
muhammadravi251001 commented 2 months ago

self-assign