SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for InterBEST-2009 #618

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: interbest_2009/interbest_2009.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?interbest_2009

Dataset interbest_2009
Description InterBEST-2009 is a publicly available corpus for Thai word segmentation. It contains about five million words from four domains: novels, articles, news, and encyclopedia. The dataset was created by NECTEC for a software contest.
Subsets -
Languages tha
Tasks Word lists
License Creative Commons Attribution Share Alike 3.0 (cc-by-sa-3.0)
Homepage http://thailang.nectec.or.th/downloadcenter/indexae01.html?option=com_docman&task=cat_view&gid=40&Itemid=61
HF URL -
Paper URL -