Closed harshsikka closed 1 year ago
In progress dataset notes, extension from here. This table is generated from this google sheets. Feel free to edit, and then can periodically update the table in this comment. To generate a table from the sheet into markdown, use this.
**Environment** | Tasks | Episodes | Approx Tokens | Sample Weight | Agent used | **Open-Source Repo** | **Additional information** | **Similar Available Datasets** |
---|---|---|---|---|---|---|---|---|
DM LAB | 254 | 16.4M | 194B | 9.35% | IMPALA | DM Lab | Appendix F.5 of the Gato paper mentions that they trained an IMPALA agent on a set of 18 parent DM Lab levels. “Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills”. We don’t have much information on the definition of those 18 “parent levels” and the 237 “handcrafted levels”. But there are a lot of different levels here: https://github.com/deepmind/lab/tree/master/game_scripts/levelsCheck out this paper which claims SOTA with an IMPALA agent on DM Lab 30: https://arxiv.org/pdf/1809.04474v1.pdf | |
ALE Atari | 51 | 63.4K | 1.26B | 9.50% | Muesli agent for 200M steps per environment | ALE Atari | RL Unplugged which is sourced from batch_rl generated from DQN replay (may want to filter, check methodology on CQL-scale-generalizes, multi-game-dt. This repo, also has filtered variants: d4rl-atari | |
ALE Atari Extended | 28 | 28.4K | 565M | 10.00% | Muesli agent for 200M steps per environment | ALE Atari | ||
Sokoban | 1 | 27.2K | 298M | 1.33% | Muesli agent | Sokoban | ||
Baby AI | 46 | 4.61M | 22.8B | 9.06% | Built-in BabyAI bot with 100 000 episodes for each level | Baby AI | ||
DM Control Suite | 30 | 395K | 22.5B | 4.62% | D4PG | DM Control | In Appendix F.4 of the Gato paper, the authors mention that “for each task in the control suite, they collect two disjoint sets of data, one using only state features and another using only pixels'’ . They use a D4PG agent to collect data from tasks with state features, and an MPO based agent to collect data with pixels. They also collect data for randomized versions of the control suite tasks with a D4PG agent. They randomize the actuator gear, joint range, stiffness, and damping and geom size and density from a small interval and a large interval.There are some SOTA agents here :https://paperswithcode.com/dataset/deepmind-control-suite | RL Unplugged provides some datasets. Specifically, they say most DM control data is generated with D4PG, or V-MPO on manipulator insert ball/peg |
DM Control Suite Pixels | 28 | 485K | 35.5B | 7.07% | MPO | DM Control | ||
DM Control Suite Random Small | 26 | 10.6M | 313B | 3.04% | D4PG | DM Control | ||
DM Control Suite Random Large | 26 | 26.1M | 791B | 3.04% | D4PG | DM Control | ||
Meta-World | 45 | 94.6K | 3.39B | 8.96% | MPO agent | Meta-World | Appendix F.9 of the Gato paper mention that they collected data from all train and test tasks in the MT50 mode by training a MPO agent with unlimited environment seeds and access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state. | |
Procgen Benchmark | 16 | 1.6M | 4.46B | 5.34% | R2D2 agent | Procgen | Appendix F.6 from the Gato paper mention that they trained a R2D2 agent on the 16 environments at the hard difficulty setting except for the maze and heist which they set to easy. OpenRL has some benchmarks here: https://wandb.ai/openrlbenchmark/openrlbenchmark/reportlist | |
RGB Stacking Simulator | 1 | 387K | 24.4B | 1.33% | RGB Stacking | The repo contains specialist agent | ||
RGB Stacking real robot | 1 | 15.7K | 980M | 1.33% | ||||
Modular RL | 38 | 843K | 69.6B | 8.23% | D4PG for a total of 140M steps with 30 random seeds | Modular RL | Appendix F.7 of the Gato paper mentions that the authors trained a D4PG agent on each variant for a total of 140M actor steps with 30 random seeds per variant. | |
DM Manipulation Playground | 4 | 286K | 6.58B | 1.68% | The Gato paper mentions it contains 4 tasks of simulated Kinova Jaco arm but I cant find any specific repo or source for the “DM Manipulation playgroun”. Searching for ‘jaco’ in the DM control suite repo yields multiple results…. so maybe it is included in the DM Control suite repo? | |||
Playroom | 1 | 829K | 118B | 1.33% | The word “Playroom” literally appears only once in the paper… I found a reference to a “Playroom” environment in a repo from Google Research: https://github.com/google-research/google-research/tree/master/playrooms | |||
total | 596 | 85.21% |
I really like the idea of putting together this dataset table @daniellawson9999! We can do a similar one for vision and language (start by porting over the one from torch-gato). Do we want to add a section for alternatives for the non-open source datasets?
Since this has been broken into #4 #5 #6, we can likely close this, no?
Regarding text and vision-text dataset: I suggest we consider the following open-source datasets in the first release. Tested loading some of them with Huggingface's load_dataset:
Findings: the image-text datasets saved on Huggingface (LAION-400M), M4-COCO) save the images' URLs as strings, not the actual images. There are two obvious issues of using these datasets in this way - 1) an image associated with certain URL might be deleted and not available when it is needed, 2) fetching images from each URL when using them is very inefficient and can become a bottleneck of performance. Therefore we should download such datasets from their original sources with the images and annotations and save them locally. And we also need to write script to align images and their corresponding texts in a way that they can be fed to the model training very efficiently.
Helen already downloaded some such datasets, some of the downloaded ones do save the actual the images (OK-VQA, COCO), but we need a script to align images and annotations. Some of the downloaded datasets (Conceptual ) are in Excel format and have images saved as URLs, so there will be the same issues as mentioned above. We either need to dig further from the data source to find the actual images or write a script to fetch all of the images and save them locally
I also studied the structure of the downloaded OK-VQA dataset to inspect the matching between images and annotations, so far they look good.
Found a very good tool img2dataset (https://github.com/rom1504/img2dataset) that can download vision-text data from a set of images URLs (could be a large set such as 100M URLs) along with their corresponding texts into webdataset format (group of tar files with each of them containing a large number that contain .jpg, .txt, .json files, we can consider these tar files as shards of the whole dataset).
Adapted the downloading scripts from img2dataset web site and tested downloading COCO dataset into webdataset format successfully.
Also worked on Conceptual Caption dataset, added column names to the meta-data file (in .tsv) that contain the image URLs and the corresponding captions so the column names can be recognized by the img2dataset script. Took a portion of the meta-data file and compiled a smaller .tsv file, tested downloading data from that smaller .tsv file successfully into webdataset format. When we need the complete dataset, we can follow the same approach to download
Need to find out how to load such downloaded dataset into the training of NEKO model. webdataset does provide some libraries for that purpose, need some further investigation about that. "Open Flemingo" project is using webdataset to load datasets (MC4 and LAION-2B), its codebase can be reviewed for some insight.
For huge dataset such as LAION-400M, its downloaded dataset is about 10TB, it takes at least a few days or even a week to download (depending on the setup of the computer to download it). If we plan to use this dataset, we can't not do it before we get our hands on a cloud service such as AWS/GCP. We need to forge an efficient schema to download, and feed the data into the to model for distributed training on AWS/GCP
Therefore, in the first phase, we may only be able to use some smaller datasets, such as:
We should also add another phase before the first phase, let's call it phase 0, that is the phase for code sanity check, i.e. verify that the code does run and the training loops can be launched and run. For that purpose, we need even smaller dataset, since we will most likely be performing the work of this phase on a consumer PC (perhaps without CPU, not even a single GPU). The above-mentioned datasets usually have a much smaller counterpart on huggingceface for such purpose. The following are some examples: https://huggingface.co/datasets/NeelNanda/pile-10k , the first 10k rows of Pile https://huggingface.co/datasets/RIW/small-coco, about 10K rows of COCO https://huggingface.co/datasets/Graphcore/vqa, a tiny fraction of the VQA dataset
The datasets provided in the original GATO paper are varied and numerous. We need a preliminary analysis of what data is availability, what data has equivalents, and what data is not clearly source able.