IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 60 forks source link

Create dataset loader for Indo_MultiModal_LAION #308

Open SamuelCahyawijaya opened 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_mm_laion

Dataset id_mm_laion
Description Indo_MultiModal_LAION is a translated subset of the LAION-400M dataset with 70M image-text pairs specifically meant to be used for vision-language pre-training in Indonesian language. LAION-400M is a dataset with 400M English (image, text) pairs, filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. More info for LAION-400M: https://laion.ai/blog/laion-400-open-dataset/.
License From LAION-400M: We distribute the metadata dataset (the parquet files) under the most open Creative Common CC-BY 4.0 license, which poses no particular restriction. The images are under their copyright.
acul3 commented 2 years ago

self-assign