Evaluation of datasets used for flamingo

I have looked into Flamingo and I have added the relevant datasets to the survey sheet.

They used three different datasets:

ALIGN: A dataset consisting of 1.8B images with text pairs. The dataset that mostly approximates in size an scope is LAION2B.
LTIP: A dataset consisting of 300 Million images that they collected. Something like COYO-300M approaches it, the labels from COYO are actually nice and it resembles the approach of how they got the LTIP dataset.
"M3W: Interleaved image and text dataset. " The alternatives for this dataset are OBELICS and MMC4. OBELICS is more diverse and has a lot more data, the documents are longer too. MMC4 has a core version that would be good to have some test runs.

In conclusion, the datasets originally used can be replaced by there open source alternatives (see OpenFlamingo and IDEFICS). I prefer OBELICS over MMC4 because it is more diverse. For ALIGN, the LAION2B dataset could be a good replacement because it approaches in size and it is the succesor of LAION400M.

Both IDEFICS and OpenFlamingo where trained on LAION.

This graph from the OBELICS paper is interesting because it shows the importance of M3W (in this case the replacemente OBELICS):

For a more in depth discussion check out issue #58.

ManifoldRG / NEKO

Evaluation of datasets used for flamingo #31