IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 62 forks source link

Create dataset loader for CORD #180

Open SamuelCahyawijaya opened 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?cord_v2

Dataset cord_v2
Description In this paper, we introduce a novel dataset called CORD, which stands for a Consolidated Receipt Dataset for post-OCR parsing. To the best of our knowledge, this is the first publicly available dataset which includes both box-level text and parsing class annotations. The parsing class labels are provided in two-levels. The eight superclasses include store, payment, menu, subtotal, and total. The eight superclasses are subdivided into 54 subclasses e.g., store has nine subclasses including name, address, telephone, and fax.
Furthermore, it also provides line annotations for the serialization task which is a newly emerging problem as a combination of the two tasks.
License CC-BY 4.0