pranavguru commented 3 months ago

Research (both literature review and architectural scoping) around utilizing VLMs for Control fine-tuning and inference.

devjwsong commented 3 months ago

Possible ways:

JSON output with corresponding key-value pairs.
Simple textual expression. (e.g. "put a ball into a bowl")
Python code. (e.g. grab(object))
Reserving special action tokens. (e.g. RT-2)

devjwsong commented 3 months ago

Some thoughts:

First, we should choose a way without architectural changes. This means that we try a prompt-based way.
- We should be able to evaluate the model's performance without fine-tuning.
- In fact, we cannot even hack the model or vocabulary because some VLMs are closed-sourced.
The output should be easy to parse and interpret.
- JSON: simple, easy to parse, interpretable.
- Python: useful when executing an actual robot arm.
- Text: less formatted, hard to control.
If we choose the JSON format, the actions are represented:
- Discrete action spaces:
  - DM Lab: {distance: far, reward: positive}
  - Atari: {action: leftfire} or just a category number given the list.
- Continuous action spaces:
  - MuJoCo: {thigh_joint: 0.3579, leg_joint: -0.424, foot_joint: 0.987}
  - DM Control Suite: {arm_root: 0.465, arm_wrist: -0.985, ...}
The steps of inference/fine-tuning:
- We design a prompt to specify how the response should look like and which key and value should be included as specific as possible per dataset.
- Then we put this dataset-specific instruction when inferencing with a model.
- Then we additionally parse this JSON format response into actual action values to update the environment.
- When fine-tuning, we add this instruction in our fine-tuning data so that the model can follow the specific form of generation.

devjwsong commented 3 months ago

Reference literatures:

Empowering Large Language Models on Robotics Manipulation with Affordance Prompting
- LLM + A framework.
  - Observation Descriptor: A VLM takes a visual observation and a designed prompt template to generate the text description.
    - This description includes the spatial information of the objects and all functional parts of them. (e.g. position, four sides of a box)
  - Sub-task planner: With the text description from the VLM and the natural language instruction, it decomposes the task into a sequences of sub-tasks.
    - Functions of the arm, guidelines/context, task instruction, text description of the observation…
  - Motion controller: It generates the actual action sequence with the decomposed sub-tasks and affordance values.
    - We can also provide some examples as demonstrations and the form of the results, such as JSON.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- The point of this model is that the LLM/VLM is trained to directly generate the action tokens along with the text generation tasks.
  - No need of additional parser or interpreter.
- Robot-Action Fine-tuning
  - They used the existing 256 tokens as action tokens to represent the continuous dimensions discretized into 256 bins.
  - Then they fine-tuned the VLM into an action generator with 6 DoFs and termination command.
    - They used the least used tokens or special tokens reserved for special purpose.
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning
- This work argues that the conventional use case of LLMs in robotics control is not grounded and requires an additional affordance model.
  - Which is sub-optimal due to the possible limitation of the visual model and less informative description of the observation.
- They used GPT-4V to directly ground the visual information into the language model.
  - First, the model comprehends the environment through CoT, generating the overall task-related object and location information.
  - Then it generates the steps of tasks to accomplish the goal.
  - The first step is chosen and it is injected into the finished plan information for the next step.
    - The steps are just natural language instructions.
- From Text to Motion: Grounding GPT-4 in a Humanoid Robot “ALTER3”
- This work represents the actuators in the robot into 43 Axes and assigns the value 0 ~ 255 for intensity.
- Steps to control the robot
  - First, GPT-4 generates the motion recipe given the prompt, which includes the task instruction, examples, and guideline.
  - Next, GPT-4 generates the Python code to do each motion with alter.set_axes function.
- GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
- Two main pipelines
  - Symbolic task planner
    - It takes the video, text or both to generate a sequence of robot actions.
      - The video analyzer uses GPT-4V to recognize the actions by humans and converts them into textual instructions.
      - The scene analyzer uses GPT-4V to generate the environmental information (e.g. objects, spatial relations, properties…) given the first frame and instruction.
      - The task planner generates the task sequence, corresponding instructions, environmental results given the environment and environment.
        
        The output is in JSON format.
        
        The actions are represented in Action(arg1, args2, …)
    - The text can also be a correction feedback on the model’s recognition results.
  - Affordance analyzer
    - It analyzes the video to determine when and where the tasks occur.
    - It extracts the affordance information for task executions.

devjwsong commented 3 months ago

Moving onto the testing.

ManifoldRG / MultiNet

Research VLM to Control mapping #84

85 #86 #87