IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
262 stars 62 forks source link

Create dataset loader for ICON Indonesian Constituency Treebank #368

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?icon

Dataset icon
Description In this work, we publish ICON (Indonesian CONstituency treebank), a manually-annotated benchmark Indonesian constituency treebank with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. We use 15 phrase level tags and 24 POS tags. The sentences were taken from Wikipedia (3000) and news articles (7000).
License CC-BY-SA 4.0