SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for XL-Sum #32

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 11 months ago

Dataloader name: xl_sum/xl_sum.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xl_sum

Dataset xl_sum
Description XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, was extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, including 4 indigenous languages spoken in Southeast Asia region.
Subsets XL-Sum Burmese, XL-Sum Indonesian, XL-Sum Thai, XL-Sum Vietnamnese
Languages mya, ind, tha, vie
Tasks Abstractive Summarization
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/csebuetnlp/xl-sum
HF URL https://huggingface.co/datasets/csebuetnlp/xlsum
Paper URL https://aclanthology.org/2021.findings-acl.413/
rmahendra commented 11 months ago

self-assign

github-actions[bot] commented 10 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

rmahendra commented 10 months ago

Hi, I'm still willing to work on this issue. However, I am quite busy at this moment. I'll try to PR by the end of this month.

holylovenia commented 9 months ago

Okay then, @rmahendra. Feel free to let us know if you need any help!

sabilmakbar commented 8 months ago

Hi @rmahendra, have you got the time to implement this dataloader?

sabilmakbar commented 7 months ago

Btw we have xl_sum already in SEACrowd but only for ID-EN pair. Will extend that script to cover the others

github-actions[bot] commented 7 months ago

Hi @sabilmakbar, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

holylovenia commented 7 months ago

Adding top-priority and bonus+1 because we need this dataloader for the experiments.

sabilmakbar commented 7 months ago

Hi @holylovenia, this dataset license is supposedly to be CC-BY-NC-SA 4.0 (in this datacard, we are missing the NC info)