SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for LR-Sum #359

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: lr_sum/lr_sum.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lr_sum

Dataset lr_sum
Description LR-Sum is a news abstractive summarization dataset focused on low-resource languages. It contains human-written summaries for 39 languages and the data is based on the Multilingual Open Text corpus (ultimately derived from the Voice of America website).
Subsets -
Languages ind, vie, lao, tha, khm, mya
Tasks Abstractive Summarization
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://huggingface.co/datasets/bltlab/lr-sum
HF URL https://huggingface.co/datasets/bltlab/lr-sum
Paper URL https://aclanthology.org/2023.findings-acl.427/
elyanah-aco commented 10 months ago

self-assign