Dataset Availability Analysis

harshsikka commented 1 year ago

The datasets provided in the original GATO paper are varied and numerous. We need a preliminary analysis of what data is availability, what data has equivalents, and what data is not clearly source able.

daniellawson9999 commented 1 year ago

In progress dataset notes, extension from here. This table is generated from this google sheets. Feel free to edit, and then can periodically update the table in this comment. To generate a table from the sheet into markdown, use this.

Environment	Tasks	Episodes	Approx Tokens	Sample Weight	Agent used	Open-Source Repo	Additional information	Similar Available Datasets
DM LAB	254	16.4M	194B	9.35%	IMPALA	DM Lab	Appendix F.5 of the Gato paper mentions that they trained an IMPALA agent on a set of 18 parent DM Lab levels. “Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills”. We don’t have much information on the definition of those 18 “parent levels” and the 237 “handcrafted levels”. But there are a lot of different levels here: https://github.com/deepmind/lab/tree/master/game_scripts/levelsCheck out this paper which claims SOTA with an IMPALA agent on DM Lab 30: https://arxiv.org/pdf/1809.04474v1.pdf
ALE Atari	51	63.4K	1.26B	9.50%	Muesli agent for 200M steps per environment	ALE Atari		RL Unplugged which is sourced from batch_rl generated from DQN replay (may want to filter, check methodology on CQL-scale-generalizes, multi-game-dt. This repo, also has filtered variants: d4rl-atari
ALE Atari Extended	28	28.4K	565M	10.00%	Muesli agent for 200M steps per environment	ALE Atari
Sokoban	1	27.2K	298M	1.33%	Muesli agent	Sokoban
Baby AI	46	4.61M	22.8B	9.06%	Built-in BabyAI bot with 100 000 episodes for each level	Baby AI
DM Control Suite	30	395K	22.5B	4.62%	D4PG	DM Control	In Appendix F.4 of the Gato paper, the authors mention that “for each task in the control suite, they collect two disjoint sets of data, one using only state features and another using only pixels'’ . They use a D4PG agent to collect data from tasks with state features, and an MPO based agent to collect data with pixels. They also collect data for randomized versions of the control suite tasks with a D4PG agent. They randomize the actuator gear, joint range, stiffness, and damping and geom size and density from a small interval and a large interval.There are some SOTA agents here :https://paperswithcode.com/dataset/deepmind-control-suite	RL Unplugged provides some datasets. Specifically, they say most DM control data is generated with D4PG, or V-MPO on manipulator insert ball/peg
DM Control Suite Pixels	28	485K	35.5B	7.07%	MPO	DM Control
DM Control Suite Random Small	26	10.6M	313B	3.04%	D4PG	DM Control
DM Control Suite Random Large	26	26.1M	791B	3.04%	D4PG	DM Control
Meta-World	45	94.6K	3.39B	8.96%	MPO agent	Meta-World	Appendix F.9 of the Gato paper mention that they collected data from all train and test tasks in the MT50 mode by training a MPO agent with unlimited environment seeds and access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state.
Procgen Benchmark	16	1.6M	4.46B	5.34%	R2D2 agent	Procgen	Appendix F.6 from the Gato paper mention that they trained a R2D2 agent on the 16 environments at the hard difficulty setting except for the maze and heist which they set to easy. OpenRL has some benchmarks here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reportlist
RGB Stacking Simulator	1	387K	24.4B	1.33%		RGB Stacking	The repo contains specialist agent
RGB Stacking real robot	1	15.7K	980M	1.33%
Modular RL	38	843K	69.6B	8.23%	D4PG for a total of 140M steps with 30 random seeds	Modular RL	Appendix F.7 of the Gato paper mentions that the authors trained a D4PG agent on each variant for a total of 140M actor steps with 30 random seeds per variant.
DM Manipulation Playground	4	286K	6.58B	1.68%			The Gato paper mentions it contains 4 tasks of simulated Kinova Jaco arm but I cant find any specific repo or source for the “DM Manipulation playgroun”. Searching for ‘jaco’ in the DM control suite repo yields multiple results…. so maybe it is included in the DM Control suite repo?
Playroom	1	829K	118B	1.33%			The word “Playroom” literally appears only once in the paper… I found a reference to a “Playroom” environment in a repo from Google Research: https://github.com/google-research/google-research/tree/master/playrooms
total	596			85.21%

atharva-tendle commented 1 year ago

I really like the idea of putting together this dataset table @daniellawson9999! We can do a similar one for vision and language (start by porting over the one from torch-gato). Do we want to add a section for alternatives for the non-open source datasets?

harshsikka commented 1 year ago

Since this has been broken into #4 #5 #6, we can likely close this, no?

henryj18 commented 1 year ago

Regarding text and vision-text dataset: I suggest we consider the following open-source datasets in the first release. Tested loading some of them with Huggingface's load_dataset:

LAION-400M (image-text), load_dataset with stream=true (so no download of whole dataset) and just took a peek of the dataset to verify it is the same as it is displayed on Huggingface
M4-COCO & OK-VQA (image-text), these two names seem to actually represent the same datasets? load_dataset tested loading a similar dataset HuggingFaceM4/COCO (loaded the complete dataset) and took a peek of the dataset Downloaded OK-VQA from original data source https://okvqa.allenai.org/ , seems MS COCO and OK-VQA are in the same dataset?
Pile (pure text), load_dataset with stream=true and just took a peek of the dataset to verify it is the same as it is displayed on Huggingface

Findings: the image-text datasets saved on Huggingface (LAION-400M), M4-COCO) save the images' URLs as strings, not the actual images. There are two obvious issues of using these datasets in this way - 1) an image associated with certain URL might be deleted and not available when it is needed, 2) fetching images from each URL when using them is very inefficient and can become a bottleneck of performance. Therefore we should download such datasets from their original sources with the images and annotations and save them locally. And we also need to write script to align images and their corresponding texts in a way that they can be fed to the model training very efficiently.

Helen already downloaded some such datasets, some of the downloaded ones do save the actual the images (OK-VQA, COCO), but we need a script to align images and annotations. Some of the downloaded datasets (Conceptual ) are in Excel format and have images saved as URLs, so there will be the same issues as mentioned above. We either need to dig further from the data source to find the actual images or write a script to fetch all of the images and save them locally

I also studied the structure of the downloaded OK-VQA dataset to inspect the matching between images and annotations, so far they look good.

henryj18 commented 1 year ago

Found a very good tool img2dataset (https://github.com/rom1504/img2dataset) that can download vision-text data from a set of images URLs (could be a large set such as 100M URLs) along with their corresponding texts into webdataset format (group of tar files with each of them containing a large number that contain .jpg, .txt, .json files, we can consider these tar files as shards of the whole dataset).

Adapted the downloading scripts from img2dataset web site and tested downloading COCO dataset into webdataset format successfully.

Also worked on Conceptual Caption dataset, added column names to the meta-data file (in .tsv) that contain the image URLs and the corresponding captions so the column names can be recognized by the img2dataset script. Took a portion of the meta-data file and compiled a smaller .tsv file, tested downloading data from that smaller .tsv file successfully into webdataset format. When we need the complete dataset, we can follow the same approach to download

Need to find out how to load such downloaded dataset into the training of NEKO model. webdataset does provide some libraries for that purpose, need some further investigation about that. "Open Flemingo" project is using webdataset to load datasets (MC4 and LAION-2B), its codebase can be reviewed for some insight.

For huge dataset such as LAION-400M, its downloaded dataset is about 10TB, it takes at least a few days or even a week to download (depending on the setup of the computer to download it). If we plan to use this dataset, we can't not do it before we get our hands on a cloud service such as AWS/GCP. We need to forge an efficient schema to download, and feed the data into the to model for distributed training on AWS/GCP

Therefore, in the first phase, we may only be able to use some smaller datasets, such as:

Pile (it is a large pure text dataset, about 800GB, but when compared with image dataset, this is not very large)
COCO (less than 15GB)
OK-VQA (not sure about its size yet since we have not downloaded the complete dataset, but its should not be far away from the size of COCO, perhaps in the tens of GB)

We should also add another phase before the first phase, let's call it phase 0, that is the phase for code sanity check, i.e. verify that the code does run and the training loops can be launched and run. For that purpose, we need even smaller dataset, since we will most likely be performing the work of this phase on a consumer PC (perhaps without CPU, not even a single GPU). The above-mentioned datasets usually have a much smaller counterpart on huggingceface for such purpose. The following are some examples: https://huggingface.co/datasets/NeelNanda/pile-10k , the first 10k rows of Pile https://huggingface.co/datasets/RIW/small-coco, about 10K rows of COCO https://huggingface.co/datasets/Graphcore/vqa, a tiny fraction of the VQA dataset

ManifoldRG / NEKO_Archive

Dataset Availability Analysis #1