facebookresearch / LaViLa

Code release for "Learning Video Representations from Large Language Models"
MIT License
491 stars 46 forks source link

Learning Video Representations from Large Language Models

Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website

LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.

Sample Generations:

Video Generation 1 Generation 2
so now we're going to slice the bread now i'm going to do is just slice
this up into a nice chunk and
then we're going to place it
on the plate

Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here: Hugging Face Spaces

The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!

image

Introduction and installation

LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.

See INSTALL.md to install this code.

NARRATOR

NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.

NARRATOR Demo

We provide some generated samples by our NARRATOR:

Human
narration
C separates the yarn. C lifts container. C opterates the camera.
NARRATOR generation (a) C stetches the thread with both hands. C wipes the countertop with a sponge. C takes a photo shot.
NARRATOR generation (b) C pulls out the yarn with her right hand. C moves the container. A man X looks at the camera.

Run the narrator demo using Colab (no GPU needed): Open In Colab
or on the web using 🤗 Spaces: Hugging Face Spaces (thanks to @nateraw!)

Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.

# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]

# GPU mode
python demo_narrator.py --cuda

Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.

GT caption Pastry chef cutting bread into
slices during the preparation
of a dessert, inside a kitchen.
Close-up shot of the hands
of an experienced baker
skillfully kneading bread dough.
Chef preparing a sauce in
a blender, adding different
ingredients while blending.
NARRATOR (a) so now we're going to slice the bread i'm gonna make a little hole
in the middle of the dough here
all right let's blend this up
NARRATOR (b) now i'm going to do is just slice
this up into a nice chunk and
then we're going to place it
on the plate
you just keep kneading it the last step to making this
is to blend the ingredients
in the food processor

Below is a demo for 3rd-person videos.

python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]

Dual-Encoder

The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.

License

The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:

Citing LaViLa

@inproceedings{zhao2023lavila,
  title={Learning Video Representations from Large Language Models},
  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
  booktitle={CVPR},
  year={2023}
}