MichalZawalski / embodied-CoT

Embodied Chain of Thought: A robotic policy that reason to solve the task.
MIT License
93 stars 6 forks source link

Robotic Control via Embodied Chain-of-Thought Reasoning

arXiv Open In Colab HF Models Python License Static Badge

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, Sergey Levine


We present Embodied Chain-of-Thought Reasoning (ECoT): a novel approach for training robotic policies. We train a vision-language-action model to generate reasoning steps in response to instructions and images before choosing a robot action, enabling better performance, interpretability, and generalization.

Our codebase is built on top of OpenVLA. We refer to it for the detailed documentation of the code and dependencies.

Quickstart

We provide a Colab notebook containing code for loading up our ECoT policy and using it to generate reasoning and actions in response to an observation. Loading the model for inference is easy:

from transformers import AutoModelForVision2Seq, AutoProcessor

device = "cuda"
path_to_hf = "Embodied-CoT/ecot-openvla-7b-bridge"
processor = AutoProcessor.from_pretrained(path_to_hf, trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(path_to_hf, torch_dtype=torch.bfloat16).to(device)

observation = <ROBOT IMAGE OBSERVATION HERE>
instruction = <YOUR INSTRUCTION HERE>
prompt = "A chat between a curious user and an artificial intelligence assistant. " + \
    "The assistant gives helpful, detailed, and polite answers to the user's questions. " + \
    f"USER: What action should the robot take to {instruction.lower()}? ASSISTANT: TASK:"

inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)
action, generated_ids = vla.predict_action(**inputs, unnorm_key="bridge_orig", max_new_tokens=1024)
generated_text = processor.batch_decode(generated_ids)[0]

The standard model in torch.bfloat16 requires 16 GB of GPU memory, but using bitsandbytes and 4-bit quantization lowers memory usage to around 5 GB. See the Colab for more details.

Training and Evaluation

To train the models, from scratch use the following command:

torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/train.py  \
  --vla.type "prism-dinosiglip-224px+mx-bridge"  \
  --data_root_dir <path to training data root>  \
  --run_root_dir <path to checkpoint saving directory>  \
  --wandb_project <wandb project name>  \
  --wandb_entity <wandb user name>

To evaluate the model on the WidowX robot,

python3 experiments/bridge/eval_model_in_bridge_env.py
  --model.type prism-dinosiglip-224px+7b
  --pretrained_checkpoint <path to checkpoint>
  --host_ip <robot interface IP>
  --port <robot interface port>

Additionally, we provide instructions for converting, compiling, and evaluating our ECoT VLA with TensorRT-LLM, drastically improving its inference speeds while maintaining performance.

Pretrained models

We release two ECoT models trained as part of our work, and the dataset of reasonings, available on our HuggingFace page:

Explicit Notes on Model Licensing & Commercial Use: While all code in this repository is released under an MIT License, our pretrained models may inherit restrictions from the underlying base models we use. Specifically, both the above models are derived from Llama-2, and as such are subject to the Llama Community License.


Installation

See the original OpenVLA repository for detailed installation instructions.

Repository Structure

High-level overview of repository/project file-tree:


Citation

If you find our code or models useful in your work, please cite our paper:

@article{Zawalski24-ecot,
    title={Robotic Control via Embodied Chain-of-Thought Reasoning},
    author={Michał Zawalski and William Chen and Karl Pertsch and Oier Mees and Chelsea Finn and Sergey Levine},
    journal={arXiv preprint arXiv:2407.08693},
    year={2024}
}