π₯ Checkout our new work CodeNav which addresses many limitations of VisProg and generalizes it further: π₯
β
Write tool descriptions Point to the codebase which you want the CodeNav agent to use - that's right, the raw source code! - CodeNav will index and search the source code directly
β
Generate the whole program at once CodeNav iteratively generates code (which imports and invokes functions and classes from your codebase), executes it, and then decides the next step based on the execution output. The next step could be searching in the codebase or writing more code
β
Generate one function call per line CodeNav generates free-form code - think of it similar to writing a code cell in an ipython notebook. While executing the current code block, CodeNav has access to global variables created while executing previous code blocks
β
Give up if there's an execution error CodeNav will look at execution results including errors, new variables created, and STDOUT, and will try to fix errors in the next step
β
Implement tools as simple function calls CodeNav gives you, as the developer of tools, flexibility to build a full-fledged codebase as you see fit - use abstractions, use object-oriented programming - just generally follow good software development practices (meaningful class/function/variable names, docstrics, specifying argument types in your code help)
By Tanmay Gupta and Aniruddha Kembhavi
[ Project Page | Arxiv Paper | Blog ]
This repository contains the official code for VisProg - a neuro-symbolic system that solves complex and compositional visual tasks given natural language instructions. VisProg uses the in-context learning ability of GPT3 to generate python programs which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program.
This code base has been designed to be:
:white_check_mark: easy to use (a simple ipynb per task)
:white_check_mark: easy to extend with new functionality by adding new modules to VisProg
:white_check_mark: easy to extend to new tasks by adding in-context examples for these tasks
:white_check_mark: minimal and modular to make it easy to dig into and build upon
conda env create -f environment.yaml
conda activate visprog
Having setup and activated the conda environment, you should be all set to run the notebooks in the notebooks/
folder. If you use an editor like VSCode, openning the .ipynb
s within VSCode might be the easiest way to get started.
You will find a notebook for each of the following tasks, but they are quite similar in structure:
notebooks/ok_det.ipynb
notebooks/image_editing.ipynb
notebooks/nlvr.ipynb
notebooks/gqa.ipynb
Simply, enter your OpenAI API key in the cell that currently reads <Enter your key here>
and run the notebook. The notebooks are designed to be self-contained and should run end-to-end without any additional setup.
The basic structure of the notebooks is as follows:
ProgramGenerator
and ProgramInterpreter
classesPROMPT
(a text string containing in-context examples) or create_prompt
(a function that creates the prompt on the fly)ProgramGenerator
and ProgramInterpreter
objectsProgramGenerator
ProgramInterpreter
We have tried to make it easy to visualize each step of the execution trace.
For instance, when running the gqa
notebook for the instruction How many people or animals are in the image?
on assets/camel1.png
, you should see the following outputs:
BOX0=LOC(image=IMAGE,object='people')
BOX1=LOC(image=IMAGE,object='animals')
ANSWER0=COUNT(box=BOX0)
ANSWER1=COUNT(box=BOX1)
ANSWER2=EVAL(expr="{ANSWER0} + {ANSWER1}")
FINAL_RESULT=RESULT(var=ANSWER2)
It is possible that the instruction you provide is not solved correctly by VisProg. This can happen for a few reasons:
Add new modules for enabling these functionalities to engine/step_interpreters.py
. Don't forget to register these modules in register_step_interpreters
function in the same file. Here's the step interpreter for the COUNT module. All modules have a similar structure with a parse
, html
, and execute
function. The parse
function parses the program string to extract the arguments and output variable. The html
function generates the html representation for the execution trace. The execute
function executes the module and returns the output and the html (if inspect=True
) for the execution trace.
class CountInterpreter():
step_name = 'COUNT'
def __init__(self):
print(f'Registering {self.step_name} step')
def parse(self,prog_step):
parse_result = parse_step(prog_step.prog_str)
step_name = parse_result['step_name']
box_var = parse_result['args']['box']
output_var = parse_result['output_var']
assert(step_name==self.step_name)
return box_var,output_var
def html(self,box_img,output_var,count):
step_name = html_step_name(self.step_name)
output_var = html_var_name(output_var)
box_arg = html_arg_name('bbox')
box_img = html_embed_image(box_img)
output = html_output(count)
return f"""<div>{output_var}={step_name}({box_arg}={box_img})={output}</div>"""
def execute(self,prog_step,inspect=False):
box_var,output_var = self.parse(prog_step)
boxes = prog_step.state[box_var]
count = len(boxes)
prog_step.state[output_var] = count
if inspect:
box_img = prog_step.state[box_var+'_IMAGE']
html_str = self.html(box_img, output_var, count)
return count, html_str
return count
prompts/your_task_or_dataset_name.py
. Note that instead of using in-context examples to generate programs, you may experiment with different ways of prompting such as providing function signatures and docstrings without needing to change the code at all!notebooks/
folder or create a python script to run inference on a large number of examples.*Note that we have replaced ViLT for VQA with a more performant model called BLIP which was recently made available on Huggingface. This shows how easy it is to swap out or upgrade modules in VisProg.
text-davinci-003
from text-davinci-002
If you find this code useful in your research, please consider citing:
@article{Gupta2022VisProg,
title={Visual Programming: Compositional visual reasoning without training},
author={Tanmay Gupta and Aniruddha Kembhavi},
journal={ArXiv},
year={2022},
volume={abs/2211.11559}
}