Web page dataset (flamingo-like)

rom1504 commented 2 years ago

Screenshot_2022-06-19-19-53-16-011_com google android apps docs

43M pages making 183M images and 182GB of text Max 5 images (they limit to that) per page.

Sequences of text and image that are broadly in the same context I think we would need some filtering beyond "dump the whole page"

Quite likely that some images are very unrelated with the rest of the text, which isn't useful

I guess we could use clip to tell us what to keep

rom1504 commented 2 years ago

This kind of datasets allow training models that combine text and images in documents

Screenshot_2022-06-19-20-13-18-661_com google android apps docs

mehdidc commented 2 years ago

Also, a bit more info about the data collection (of Flamingo, in the appendix):

data_collection

rom1504 commented 2 years ago

Recent question about the best format My answer:

They suggest text + 5 images + sequence of local ids (tokens + images IDs) I think the only reasonable solution is a recursive format. IE

Pick a container (eg webdataset, parquet, tfrecord,..)
Store sample inside with a second format. You have choice here again: tar, tfrecord, parquet, json

I think I would recommend tar inside tar, it's the most straight forward

So in practice it would look like this

00000.tar
- 000000000.tar
  - 0.jpg
  - 1.jpg
  - 4.jpg
  - page.txt
  - token.ids

TheodoreGalanos commented 2 years ago

So I'm very interested in this and have a dataset in mind. I'll try to figure out how to scrape it, I like the structure that rom suggests above. I wouldn't mind some help with architectural implementation if someone is interested in this.

rom1504 commented 2 years ago

@christophschuhmann video on it https://youtu.be/NAspBQmxK4U

next step: write a small doc with

endgoal
steps

JeniaJitsev commented 2 years ago

I think we should discuss it more thoroughly. In the video, the suggestion is to use CLIP for deciding which text fits to which image in a sequence of text and images, using CLIP only on neighboring text and images. This I think is problematic, because CLIP is inherently an image-text pair only trained model and does not respect sequence structures with more than two elements, ignoring higher order dependencies (for example arbitrary triplets ($T_1;I_1;T_2$), quadruplets ($T_1;I_1;T_2;I_2$), etc, in a way AR sequence model would support it) and also different scales of dependencies (e.g $T_1;I_4;T_7;I_9$) that may well exist and also make sense in a longer sequence of text and images flow. CLIP based selection may thus exclude these essential dependencies from learning. As a simple example based on video explanation - even when using only pairs, comparing $T_1$ to $I_1$ may exclude $T_1$, but it can be that $T_1$ is highly relevant to $I_3$ which follows later (a dependence on a longer distance in a sequence). Further, it may happen that $T_1$ is not relevant to $I_1$ alone, but together with $T_2$ and $I_4$, it may become relevant for the sequence (different scales and higher order dependence). Arguably, those are actually interesting dependencies that contain a lot of further useful information beyond short text-image pair only alignments as modeled by CLIP, and we should attempt to learn those in frame of a sequence modelling problem (as original FLAMINGO does). We should be therefore careful whether and how we should apply CLIP to original sequences from webpages, avoiding discarding important information. I think it would be good to rather keep whole sequences of text-image-text-image... extracted from webpages or other documents intact (maybe rather using pre-trained language models to get rid of clearly non-sense text, or image models to detect strongly non-image like objects that are tagged as images) and let learning consume those as data for sequence learning task, where a pre-trained CLIP may have a role in an auxiliary loss term.

christophschuhmann commented 2 years ago

Here is my proposal for the outcomes of the preprocessing of Common Crawl. How we would filter the outcomes of this first stage, for example with or without CLIP, Can be determined later. I think we have to experiment and use what works best.

All Image-Links from Common Crawl that are surrounded by natural language text. Stored together with all Text-Image-Text samples of the same webpage in same parquet row. E.g.: IM 1-TXT 1, TXT2-IM 2 - TXT3, TXT 3 - IM 3 (assuming that the page begins with IM 1 and ends with IM 3 )
An interleaved Image text data set that parsed from each website with sufficient content a dataset of the structure: TXT 1 - IM 1 - TXT 2 - IM 2 - TXT 3 - … This can be constructed from the previous dataset by merging texts that overlap.
All Audio-Links from Common Crawl that are surrounded by natural language text. Stored together with all Text-Audio-Text samples of the same webpage in same parquet row. E.g.: AUDIO 1-TXT 1, TXT2- AUDIO 2 - TXT3, TXT 3 - AUDIO 3 (assuming that the page begins with AUDIO 1 and ends with AUDIO 3 )
All Video-Links from Common Crawl that are surrounded by natural language text. Stored together with all Text-Video-Text samples of the same webpage in same parquet row. E.g.: VIDEO 1-TXT 1, TXT2- VIDEO 2 - TXT3, TXT 3 - VIDEO 3 (assuming that the page begins with VIDEO 1 and ends with VIDEO 3 )
An interleaved Image-Audio-Video-Text data set that parsed from each website with sufficient content a dataset of the structure: TXT 1 - IM 1 - TXT 2 - AUDIO 1 - TXT 3 - IM 2 - TXT 4 - VIDEO 1 - TXT 5 … This can be constructed from the previous dataset by merging texts that overlap.
A dataset of all image URLs in Common Crawl (deduplicated by URL at this stage)
A dataset of all audio URLs in Common Crawl (deduplicated by URL at this stage)
A dataset of all video URLs in Common Crawl (deduplicated by URL at this stage)

I would suggest to only consider text of the length up to 256 tokens / words before & after each image, audio or video file.

——

Independent of how we could eventually filter the interleaved datasets, - We could additionally use clip filtering approach but we applied in LAION ( maybe with a more powerful model like CLIP H or L 14 then) To create an even bigger CLIP filtered Image-Text pair dataset.

And we can also apply CLIP filtering with CLAP and Video CLIP models to the surrounding texts to get Audio-Text and Video-Text pair datasets analogous to LAION 5B once we have decent CLAP / Video-CLIP models.

——

It would in my opinion also make sense to compute the clip and weddings of all images to deduplicate them by CLIP embeddings and get aesthetics scores from the embeddings.

—> Get many, many more images with high aesthetics scores for training generative models.

christophschuhmann commented 2 years ago

Here is for first proof of concept to filter natural language text from WARC files from Common Crawl: https://colab.research.google.com/drive/1d10Stm4J2IIPcbjHF4HzwkBQJsXGJMhi#scrollTo=1NxragaBqgHZ&uniqifier=3

LAION-AI / project-menu

Web page dataset (flamingo-like) #27