SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for thai-tnhc2-books #619

Closed SamuelCahyawijaya closed 2 months ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: thai_tnhc2_books/thai_tnhc2_books.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?thai_tnhc2_books

Dataset thai_tnhc2_books
Description This dataset collects all 353 books from the Thai National Historical Corpus 2 (TNHC2) corpus. The dataset has been cleaned to use text for pretraining models and NLP tasks. The TNHC2 corpus is a Thai old books corpus and all books are copyright expired according to Thai law (50 years after the author's death). More information on this corpus can be found here: https://www.arts.chula.ac.th/chulaseal/tnhc2/.
Subsets -
Languages tha
Tasks Language Modeling
License Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage https://www.arts.chula.ac.th/chulaseal/tnhc2/
HF URL https://huggingface.co/datasets/pythainlp/thai-tnhc2-books
Paper URL -
patrickamadeus commented 2 months ago

self-assign