SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for VlogQA #621

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: vlogqa/vlogqa.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?vlogqa

Dataset vlogqa
Description VlogQA is a Vietnamese spoken language corpus for machine reading comprehension. It consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube videos around food and travel.
Subsets -
Languages vie
Tasks Question Answering
License Other (other)
Homepage https://github.com/sonlam1102/vlogqa/tree/main
HF URL -
Paper URL -
akhdanfadh commented 1 month ago

@holylovenia If I may, I want to work on this dataset. But it requires a dataset user agreement. Can I submit on behalf of the SEACrowd organization? I'm also unsure if I can receive the dataset before the dataloader implementation.

holylovenia commented 1 month ago

@holylovenia If I may, I want to work on this dataset. But it requires a dataset user agreement. Can I submit on behalf of the SEACrowd organization? I'm also unsure if I can receive the dataset before the dataloader implementation.

Sure @akhdanfadh, you can try to submit the user agreement first then we can discuss if you receive the dataset after the dataloader implementation.

akhdanfadh commented 1 month ago

I just received the dataset, working on it now.

akhdanfadh commented 1 month ago

self-assign