SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for XM3600 #76

Closed SamuelCahyawijaya closed 9 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: xm3600/xm3600.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xm3600

Dataset xm3600
Description Crossmodal-3600 dataset (XM3600 in short), a geographically-diverse set of 3600 images annotated with human-generated reference captions in 36 languages. The images were selected from across the world, covering regions where the languages are spoken, and annotated with captions that achieve consistency in terms of style across all languages, while avoiding annotation artifacts due to direct translation. The languages covered in the dataset include Filipino, Indonesian, Thai, and Vietnamnese
Subsets XM3600_fil, XM3600_id, XM3600_th, XM3600_vi
Languages fil, ind, tha, vie
Tasks Image-to-Text Generation
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://google.github.io/crossmodal-3600/
HF URL https://huggingface.co/datasets/dinhanhx/crossmodal-3600
Paper URL https://aclanthology.org/2022.emnlp-main.45/
IvanHalimP commented 9 months ago

self-assign