DavidMChan / caption-by-committee

Using LLMs and pre-trained caption models for super-human performance on image captioning.
Other
40 stars 4 forks source link
ai captioning chatgpt deep-learning image machine-learning python

IC3: Image Captioning by Committee Consensus

Method overview diagram

This is the implementation of the paper IC3: Image Captioning by Committee Consensus.

Installation

The library can be installed with:

# Install LAVIS for BLIP/BLIP2 support
$ pip install salesforce-lavis
# Install the local directory with setuptools
$ pip install .
# For the metrics, we need to download and install a spacy model
$ python -m spacy download en_core_web_lg

Next, we need to set up environment variables with API keys, if you want to use those API keys

# For OpenAI-based models, specify the following keys:
export OPENAI_API_KEY=<api key>
export OPENAI_API_ORG=<org>

# For Huggingface Inference engine models, specify the following keys:
export HUGGINGFACE_API_KEY=<api key>

The repository can be tested by running cbc caption test/test_image.jpg, which should produce a sample caption using the OFA and GPT-2 models.

Running the model using the CLI

To run the model using the CLI, you can use:

$ cbc caption <image path>

If you have a full dataset of examples, you can use:

$ cbc evaluate-dataset <dataset json>

Where the JSON format (minimally) looks like:

[
    {
        "references": ["List", "of", "references"],
        "image_path": "Relative path to image"
    },
    ...
]

For more details on these commands, see cbc caption --help and cbc evalaute-dataset --help.

Using the python API

To use the python API, see the following minimal example using GPT3 and OFA:

from cbc.caption import OFACaptionEngine
from cbc.caption_by_committee import caption_by_committee
from cbc.lm import GPT3Davinci3

def run_caption() -> None:
    # Load the image
    image = Image.open("coco_test_images/COCO_val2014_000000165547.jpg").convert("RGB")

    # Construct a captioning engine (see: cbc/caption/__init__.py for available engines)
    caption_engine = OFACaptionEngine(device="cuda:1")

    # Construct a language model engine (see cbc/lm/__init__.py for available engines)
    lm_engine = GPT3Davinci3()

    # Generate the caption
    caption = caption_by_committee(
        image,
        caption_engine=caption_engine,
        lm_engine=lm_engine,
        caption_engine_temperature=1.0,
        n_captions=15,
    )

    print(caption)

Available Captioning/LM Engines

The following captioning and language models are available for use with this library:

Captioning

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

ChatCaptioner: ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions (url)

Language Modeling

OpenAI (Requires setting the OPENAI_API_KEY and OPENAI_API_ORG environment variables):

Huggingface (Requires setting the HUGGINGFACE_API_KEY environment variable):

Huggingface (No API key required):

Summary Models:

LLaMA: Open and Efficient Foundation Language Models (Requires setting the HUGGINGFACE_LLAMA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Alpaca: A Strong, Replicable Instruction-Following Model (Requires setting the HUGGINGFACE_ALPACA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Koala: A Dialogue Model for Academic Research (Requires setting the HUGGINGFACE_KOALA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Vicuna: An Open Chatbot Impressing GPT-4 (Requires setting the HUGGINGFACE_VICUNA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Alpaca: A Strong, Replicable Instruction-Following Model (Requires setting the HUGGINGFACE_ALPACA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Koala: A Dialogue Model for Academic Research (Requires setting the HUGGINGFACE_KOALA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

Vicuna: An Open Chatbot Impressing GPT-4 (Requires setting the HUGGINGFACE_VICUNA_WEIGHTS_ROOT environment variable and preprocessing the weights according to this url.):

StableLM: Stability AI Language Models

Bard (Requires setting the GOOGLE_BARD_SESSION_ID environment variable. Get the value of this variable by first going to https://bard.google.com/, then log in, press F12 for console, and go to the "Application" tab, then "Cookies", then copy the value of the "__Secure-1PSID" cookie.):

PaLM (Requires the vertex AI client libraries from Google: https://cloud.google.com/vertex-ai/docs/start/client-libraries), and a GCP project set up with the Vertex AI API enabled.):

Claude (Requires setting the ANTHROPIC_API_KEY environment variable)

Running the demos

To load the demos, install the library, and then use streamlit to run the demo:

Single Image End-to-End Demo: streamlit run demos/single_image.py

References

If you found this work useful, cite us:

@misc{
  https://doi.org/10.48550/arxiv.2302.01328,
  doi = {10.48550/ARXIV.2302.01328},
  url = {https://arxiv.org/abs/2302.01328},
  author = {Chan, David M. and Myers, Austin and Vijayanarasimhan, Sudheendra and Ross, David A. and Canny, John},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {IC3: Image Captioning by Committee Consensus},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}