SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for WEATHub #393

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 8 months ago

Dataloader name: weathub/weathub.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?weathub

Dataset weathub
Description WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric from their paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In their paper, they use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.
Subsets -
Languages tha, tgl, vie, cmn, eng
Tasks Word lists
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://huggingface.co/datasets/iamshnoo/WEATHub
HF URL https://huggingface.co/datasets/iamshnoo/WEATHub
Paper URL https://aclanthology.org/2023.emnlp-main.981.pdf
khelli07 commented 8 months ago

self-assign

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 commented 7 months ago

Working on it now

khelli07 commented 7 months ago

What's the schema for this?

holylovenia commented 7 months ago

Hi @khelli07, source-only for word lists.

khelli07 commented 6 months ago

Working on this now

khelli07 commented 6 months ago

Hi, currently there are no supported "Word lists" enum in seacrowd.utils.constants.Tasks. Should I make one or leave _SUPPORTED_TASKS as []?