SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for CC3M-35L #77

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: cc3m_35l/cc3m_35l.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cc3m_35l

Dataset cc3m_35l
Description CC3M-35L is created by translating Conceptual Captions 3M (Sharma et al., 2018), originally in English, to the other 34 languages using Google’s machine translation API.
Subsets fil, ind, tha, vie
Languages fil, ind, tha, vie
Tasks Image-to-Text Generation
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage CC3M-35L: https://google.github.io/crossmodal-3600/; CC3M: https://ai.google.com/research/ConceptualCaptions/download
HF URL -
Paper URL https://aclanthology.org/2022.emnlp-main.45/
IvanHalimP commented 9 months ago

self-assign

IvanHalimP commented 9 months ago

Uhm, this dataset is very huge. It takes days (without parallelization) to load and download all the images locally. I've been running the loader since December 3rd, and it is not even finished by the time I submitted this comment. Currently trying to speed it up using parallelization. But I'm not sure how it will improve. Are you sure all that all of the images needs to be downloaded locally before it can be used?

SamuelCahyawijaya commented 8 months ago

Hi @IvanHalimP, sorry for the late reply. I am testing the dataloader right now and, as you mention it takes some time to generate the dataset. I will check if there is a way to speed up the process. I'll push some updates on this later this week.