SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for COCO-35L #78

Closed SamuelCahyawijaya closed 10 months ago

SamuelCahyawijaya commented 12 months ago

Dataloader name: coco_35l/coco_35l.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?coco_35l

Dataset coco_35l
Description COCO-35L is a machine-generated image caption dataset, constructed by translating COCO Captions (Chen et al., 2015) to the other 34 languages using Google’s machine translation API.
Subsets fil, ind, tha, vie
Languages fil, ind, tha, vie
Tasks Image-to-Text Generation
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://google.github.io/crossmodal-3600/
HF URL -
Paper URL https://aclanthology.org/2022.emnlp-main.45/
IvanHalimP commented 12 months ago

self-assign

IvanHalimP commented 11 months ago

152520 image ids are not found in the coco 2014 training caption. validation set is ok Using COCO 2014 train and validation set.