SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for COCO-35L #78

Closed SamuelCahyawijaya closed 8 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: coco_35l/coco_35l.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?coco_35l

Dataset coco_35l
Description COCO-35L is a machine-generated image caption dataset, constructed by translating COCO Captions (Chen et al., 2015) to the other 34 languages using Google’s machine translation API.
Subsets fil, ind, tha, vie
Languages fil, ind, tha, vie
Tasks Image-to-Text Generation
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://google.github.io/crossmodal-3600/
HF URL -
Paper URL https://aclanthology.org/2022.emnlp-main.45/
IvanHalimP commented 9 months ago

self-assign

IvanHalimP commented 9 months ago

152520 image ids are not found in the coco 2014 training caption. validation set is ok Using COCO 2014 train and validation set.