SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

68 stars 57 forks source link

Create dataset loader for XL-Sum #32

Closed SamuelCahyawijaya closed 8 months ago

SamuelCahyawijaya commented 1 year ago

Dataloader name: xl_sum/xl_sum.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xl_sum

Dataset	xl_sum
Description	XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, was extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, including 4 indigenous languages spoken in Southeast Asia region.
Subsets	XL-Sum Burmese, XL-Sum Indonesian, XL-Sum Thai, XL-Sum Vietnamnese
Languages	mya, ind, tha, vie
Tasks	Abstractive Summarization
License	Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage	https://github.com/csebuetnlp/xl-sum
HF URL	https://huggingface.co/datasets/csebuetnlp/xlsum
Paper URL	https://aclanthology.org/2021.findings-acl.413/

rmahendra commented 1 year ago

self-assign

github-actions[bot] commented 1 year ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

rmahendra commented 12 months ago

Hi, I'm still willing to work on this issue. However, I am quite busy at this moment. I'll try to PR by the end of this month.

holylovenia commented 11 months ago

Okay then, @rmahendra. Feel free to let us know if you need any help!

sabilmakbar commented 9 months ago

Hi @rmahendra, have you got the time to implement this dataloader?

sabilmakbar commented 9 months ago

Btw we have xl_sum already in SEACrowd but only for ID-EN pair. Will extend that script to cover the others

github-actions[bot] commented 8 months ago

Hi @sabilmakbar, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

holylovenia commented 8 months ago

Adding top-priority and bonus+1 because we need this dataloader for the experiments.

sabilmakbar commented 8 months ago

Hi @holylovenia, this dataset license is supposedly to be CC-BY-NC-SA 4.0 (in this datacard, we are missing the NC info)