SamuelCahyawijaya commented 9 months ago

Dataloader name: cc3m_35l/cc3m_35l.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cc3m_35l

Dataset	cc3m_35l
Description	CC3M-35L is created by translating Conceptual Captions 3M (Sharma et al., 2018), originally in English, to the other 34 languages using Google’s machine translation API.
Subsets	fil, ind, tha, vie
Languages	fil, ind, tha, vie
Tasks	Image-to-Text Generation
License	Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage	CC3M-35L: https://google.github.io/crossmodal-3600/; CC3M: https://ai.google.com/research/ConceptualCaptions/download
HF URL	-
Paper URL	https://aclanthology.org/2022.emnlp-main.45/

IvanHalimP commented 9 months ago

self-assign

IvanHalimP commented 9 months ago

Uhm, this dataset is very huge. It takes days (without parallelization) to load and download all the images locally. I've been running the loader since December 3rd, and it is not even finished by the time I submitted this comment. Currently trying to speed it up using parallelization. But I'm not sure how it will improve. Are you sure all that all of the images needs to be downloaded locally before it can be used?

SamuelCahyawijaya commented 8 months ago

Hi @IvanHalimP, sorry for the late reply. I am testing the dataloader right now and, as you mention it takes some time to generate the dataset. I will check if there is a way to speed up the process. I'll push some updates on this later this week.

SEACrowd / seacrowd-datahub

Create dataset loader for CC3M-35L #77

self-assign