Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website
LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.
Sample Generations:
Video | Generation 1 | Generation 2 |
---|---|---|
so now we're going to slice the bread | now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate |
Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here:
The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!
LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.
See INSTALL.md to install this code.
NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.
We provide some generated samples by our NARRATOR:
Human narration |
C separates the yarn. | C lifts container. | C opterates the camera. |
NARRATOR generation (a) | C stetches the thread with both hands. | C wipes the countertop with a sponge. | C takes a photo shot. |
NARRATOR generation (b) | C pulls out the yarn with her right hand. | C moves the container. | A man X looks at the camera. |
Run the narrator demo using Colab (no GPU needed):
or on the web using 🤗 Spaces: (thanks to @nateraw!)
Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.
# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]
# GPU mode
python demo_narrator.py --cuda
Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.
GT caption | Pastry chef cutting bread into slices during the preparation of a dessert, inside a kitchen. |
Close-up shot of the hands of an experienced baker skillfully kneading bread dough. |
Chef preparing a sauce in a blender, adding different ingredients while blending. |
NARRATOR (a) | so now we're going to slice the bread | i'm gonna make a little hole in the middle of the dough here |
all right let's blend this up |
NARRATOR (b) | now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate |
you just keep kneading it | the last step to making this is to blend the ingredients in the food processor |
Below is a demo for 3rd-person videos.
python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.
LaViLa's dual-encoder achieves excellent zero-shot performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.
^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).
^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
For details on how to get the numbers, please refer to MODEL_ZOO.md.
Once fine-tuned on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.
For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to MODEL_ZOO.md.
The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:
https://github.com/EGO4D/episodic-memory is licensed under the MIT license.
The videos of cutting a loaf, kneading a dough, and preparing a sauce in a blender are licensed under the Mixkit Stock Video Free License.
@inproceedings{zhao2023lavila,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={CVPR},
year={2023}
}