SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for BEST #521

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: best/best.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?best

Dataset best
Description The Benchmark for Enhancing the Standard of Thai language processing (BEST) is a Thai language dataset. The task is to determine where to split phrases into separate words, that is word and sentence segmentation. In Thai this is a non-trivial task.
Subsets BEST 2010
Languages tha
Tasks Named Entity Recognition, Statement Tagging
License Creative Commons Attribution Non Commercial Share Alike 3.0 (cc-by-nc-sa-3.0)
Homepage https://github.com/korakot/corpus/tree/main/BEST
HF URL -
Paper URL http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf