This dataset collects all 353 books from the Thai National Historical Corpus 2 (TNHC2) corpus. The dataset has been cleaned to use text for pretraining models and NLP tasks. The TNHC2 corpus is a Thai old books corpus and all books are copyright expired according to Thai law (50 years after the author's death). More information on this corpus can be found here: https://www.arts.chula.ac.th/chulaseal/tnhc2/.
Dataloader name:
thai_tnhc2_books/thai_tnhc2_books.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?thai_tnhc2_books