SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for ProSub #683

Open SamuelCahyawijaya opened 1 month ago

SamuelCahyawijaya commented 1 month ago

Dataloader name: prosub/prosub.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?prosub

Dataset prosub
Description ProSub is a collection of datasets and corpus annotations dealing with pronoun substitutes and related linguistic categories (personal pronouns, honorific titles, address terms). Pronoun substitutes are non-pronominal expressions (e.g. 'mother', 'aunt', 'teacher') used to refer to the speaker and the addressee, thus functioning like 1st and 2nd person personal pronouns. Pronoun substitutes are very common in languages in SEA, Japan and Korea, but extremely limited elsewhere. The Common subset is based on a common questionnaire. It provides information about whether a given concept (e.g. 'child') can be used as 1st person, 2nd person, title and address term. If the use exists, example sentences are also given. The Annotations subset contains annotation of 1st and 2nd person expressions, including both personal pronouns and pronoun substitutes, and address terms. The corpora used differ from language to language. However, the annotation scheme is the same across languages.
Subsets Common, Annotations
Languages zsm, ind, jav, tha, vie, mya
Tasks Word Sense Disambiguation, Word lists, Semantic Role Labeling, Machine Translation
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/matbahasa/ProSub
HF URL -
Paper URL https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P9-4.pdf