SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for Burapha-TH #105

Closed SamuelCahyawijaya closed 8 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: burapha_th/burapha_th.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?burapha_th

Dataset burapha_th
Description The dataset has 68 character classes, 10 digit classes, and 320 syllable classes. For constructing the dataset, 1072 Thai native speakers wrote on collection datasheets that were then digitized using a 300 dpi scanner. De-skewing, detection box and segmentation algorithms were applied to the raw scans for image extraction. The dataset, unlike all other known Thai handwriting datasets, retains existing noise, the white background, and all artifacts generated by scanning.
Subsets character, digit, syllable
Languages tha
Tasks Optical Character Recognition
License Unknown (unknown)
Homepage https://services.informatics.buu.ac.th/datasets/Burapha-TH/
HF URL -
Paper URL https://www.mdpi.com/2076-3417/12/8/4083
IvanHalimP commented 9 months ago

self-assign