ManifoldRG / NEKO

In Progress Implementation of GATO style Generalist Multimodal model capable of image, text, RL and Robotics tasks
https://discord.gg/brsPnzNd8h
GNU General Public License v3.0
38 stars 9 forks source link

Evaluation of datasets used for flamingo #31

Closed BobakBagheri closed 5 months ago

BobakBagheri commented 7 months ago

Issue #10 discusses existing open source datasets, conceptual captions, QKVQA, and VQAv2 and start the work on estimating costs of using these datasets for NEKO. Out of this conversation came the following question from Daniel. We want to explore these datasets as well as part of our investigation into various datasets and their use for NEKO.

Question, what do you think of the datasets used for flamingo, e.g. LAION and Multimodal C4 https://laion.ai/blog/open-flamingo/ ?

snat-s commented 5 months ago

I have looked into Flamingo and I have added the relevant datasets to the survey sheet.

They used three different datasets:

In conclusion, the datasets originally used can be replaced by there open source alternatives (see OpenFlamingo and IDEFICS). I prefer OBELICS over MMC4 because it is more diverse. For ALIGN, the LAION2B dataset could be a good replacement because it approaches in size and it is the succesor of LAION400M.

Both IDEFICS and OpenFlamingo where trained on LAION.

This graph from the OBELICS paper is interesting because it shows the importance of M3W (in this case the replacemente OBELICS):

image

For a more in depth discussion check out issue #58.