Research VLA to VL mapping

Octo

Architecture

* Input tokenizers that convert language, observations, and goals to tokens
* Language inputs - T5 base
* Image observations and goals - Convnet stack
* Add position embeddings to task and obs tokens and arrange them sequentially
* Tokens are passed through a Transformer backbone
* Attention is block-wise masked
* Obs tokens can only attend to obs tokens from previous tokens, and task tokens
* Readout tokens - attends to prev obs and task tokens, but is not attended to by any other tokens. Analogous to CLS 
   token in BERT
* Lightweight action head that implements diffusion process applied to readout embedding, and predicts chunk of  
   consecutive actions
* Can flexibly add alternate output heads to the embeddings from Transformer backbone

Training objective

 * Conditional diffusion decoding head that predicts continuous multimodal actions
 * Only one forward pass of transformer backbone per action prediction. Multi-step denoising process carried out 
    entirely in the diffusion head

OpenVLA

Architecture

* Prismatic-7B VLM backbone
* 2 part vision encoder - pre-trained SigLIP and Dinov2
   - input images are passed through both encoders and feature vectors from both are concatenated channel-wise
* Llama 2 for language
* Custom ActionTokenizer class that extends regular PreTrainedTokenizerBase to handle continuous robot actions. It 
   discretizes continuous actions into bins, maps them to tokens, and provides methods to convert between actions 
   and tokens. This allows the integration of continuous robot actions into the discrete token space of language 
   models, enabling the model to work with both language and robot control in a unified manner.

Training objective

 * Fine-tune Prismatic for robot action prediction:
 * Formulate action prediction as a vision-language task
 * Use VLM backbone to predict actions by mapping continuous actions to discrete tokens used by LLM tokenizer
 * Llama tokenizer in the VLM backbone only reserves 100 special tokens for newly introduced tokens during fine- 
    tuning, but OpenVLA uses 256 tokens for action space discretization
 * To overcome this, OpenVLA overwrites the 256 least used tokens to use as action tokens
 * Once actions are processed into seq of tokens, they are trained using regular next token prediction with cross- 
   entropy loss

Architectural modification requirements:

Adapt OpenVLA for Vision-Language and just Language
Adapt OpenVLA for discrete action space - ActionTokenizer only works for continuous actions as of now

ManifoldRG / MultiNet

Research VLA to VL mapping #83