ManifoldRG / MultiNet

MIT License
11 stars 1 forks source link

Research VLA to VL mapping #83

Closed pranavguru closed 2 months ago

pranavguru commented 3 months ago

Research (both literature review and architectural scoping) around utilizing VLA models for Vision-Language or sole Language fine-tuning and inference.

pranavguru commented 2 months ago

Octo

Architecture

* Input tokenizers that convert language, observations, and goals to tokens
* Language inputs - T5 base
* Image observations and goals - Convnet stack
* Add position embeddings to task and obs tokens and arrange them sequentially
* Tokens are passed through a Transformer backbone
* Attention is block-wise masked
* Obs tokens can only attend to obs tokens from previous tokens, and task tokens
* Readout tokens - attends to prev obs and task tokens, but is not attended to by any other tokens. Analogous to CLS 
   token in BERT
* Lightweight action head that implements diffusion process applied to readout embedding, and predicts chunk of  
   consecutive actions
* Can flexibly add alternate output heads to the embeddings from Transformer backbone

Training objective

 * Conditional diffusion decoding head that predicts continuous multimodal actions
 * Only one forward pass of transformer backbone per action prediction. Multi-step denoising process carried out 
    entirely in the diffusion head

OpenVLA

Architecture

* Prismatic-7B VLM backbone
* 2 part vision encoder - pre-trained SigLIP and Dinov2
   - input images are passed through both encoders and feature vectors from both are concatenated channel-wise
* Llama 2 for language
* Custom ActionTokenizer class that extends regular PreTrainedTokenizerBase to handle continuous robot actions. It 
   discretizes continuous actions into bins, maps them to tokens, and provides methods to convert between actions 
   and tokens. This allows the integration of continuous robot actions into the discrete token space of language 
   models, enabling the model to work with both language and robot control in a unified manner.

Training objective

 * Fine-tune Prismatic for robot action prediction:
 * Formulate action prediction as a vision-language task
 * Use VLM backbone to predict actions by mapping continuous actions to discrete tokens used by LLM tokenizer
 * Llama tokenizer in the VLM backbone only reserves 100 special tokens for newly introduced tokens during fine- 
    tuning, but OpenVLA uses 256 tokens for action space discretization
 * To overcome this, OpenVLA overwrites the 256 least used tokens to use as action tokens
 * Once actions are processed into seq of tokens, they are trained using regular next token prediction with cross- 
   entropy loss

Architectural modification requirements: