"Real-world" Multi-modal Robotics Control Datasets

daniellawson9999 commented 1 year ago

Background

The initially discussed dataset proposal focused on MuJoCo, Atari, environments, are useful for research, but are far from real-world environments. Since the release of Gato, there have been several exciting papers which train language-conditioned transformer policies for robotics. For example, a model could be trained to complete generic language tasks within some bounds, such as by language prompts such as "move the green star next to the red block". The datasets and environments used by these papers could be interesting to explore with Neko. The two datasets which I introduce (Language Table, VIMA Bench), are both new, and quite exciting. A resulting model trained on these datasets may be usable for practical robotics tasks, or for those wanting to experiment with language-conditioned robotic control.

Papers and Corresponding Datasets and Environments

Several notable papers include:

RT-1: Robotics Transformer for Real-World Control at Scale

Interactive Language: Talking to Robots in Real Time

VIMA: General Robot Manipulation with Multimodal Prompts

Availability:

RT1:
- RT1 predominantly designed for real-world language-conditioned control. While there is code partially released, along with pretrained models, it does not have a simulated environment setup, and other parts of the code is also not yet released. I would still recommend looking at the paper, uses Gato-style architecture for baseline in some experiments.
Interactive Language
- Also language-conditioned, focused around object rearrangement. Provides their full implementation, and the language table benchmark/datasets here, along with a checkpoint.
VIMA:
- Multi-modal, language-image conditioned object manipulation tasks. Provides release for model architectures, and pretraind models https://github.com/vimalabs/VIMA/tree/main. Also releases full dataset and environment VimaBench, dataset can also be found on huggingface here.
- I would highly recommend reading this paper. Interesting encoder-decoder object centric architecture, with good comparison and explanation to Gato/other decoder-only style architecture. The repository is also based around PyTorch and is very clear, with also an implementation of their Gato baseline, which may be a good reference for us: https://github.com/vimalabs/VIMA/blob/main/vima/policy/vima_gato_policy.py

Thus, either the VIMA (Bench) or Interactive Language (Bench) could be great environments and datasets for us to incorporate into Gato. Personally, I slightly prefer VIMA to start with, as it seems already to be closely tailored to work with Gato-style tokenization, and has really good documentation, and multi-modal prompting through both language and images, while language table's model input is language.

Output

Separate issues can be created for sourcing issues following this general direction. For general procedure in sourcing a control dataset, refer to meta-issue https://github.com/ManifoldRG/NEKO/issues/13 (converting to Minari). Another example issue for a control dataset is: https://github.com/ManifoldRG/NEKO/issues/12 or https://github.com/ManifoldRG/NEKO/issues/14 .

Feel free to discuss thoughts for this issue here, or create a separate issue for one of these individual datasets which contains more information or tracks progress towards its conversion.

BobakBagheri commented 1 year ago

Issues needs pick up from @snat-s in effort to review datasets of interest Need to also include more recent information to what was happening back in Jun, specifically thinking OpenRX

AshutoshPanda2002 commented 10 months ago

If considering simulation environments, I think the Vista Driving Sim from MIT could be something useful. It "provides an interface for transforming real-world datasets into virtual environments with dynamic agents, sensor suites, and task objectives".

BobakBagheri commented 10 months ago

Closing, dataset reconciliation effort is captured in #56 so refer to this for all dataset related issues

ManifoldRG / NEKO