Method | MVBench | VcgBench | VideoQABench | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Avg | Correct | Detail | Context | Temporal | Consist | MSVD | MSRVTT | ANet | ||
VideoLLaMA | 34.1 | 1.96 | 2.18 | 2.16 | 1.82 | 1.79 | 1.98 | 51.6 | 29.6 | 12.4 |
LLaMA-Adapter | 31.7 | 2.03 | 2.32 | 2.30 | 1.98 | 2.15 | 2.16 | 54.9 | 43.8 | 34.2 |
VideoChat | 35.5 | 2.23 | 2.50 | 2.53 | 1.94 | 2.24 | 2.29 | 56.3 | 45.0 | 26.5 |
VideoChatGPT | 32.7 | 2.38 | 2.40 | 2.52 | 2.62 | 1.98 | 2.37 | 64.9 | 49.3 | 35.7 |
MovieChat | - | 2.76 | 2.93 | 3.01 | 2.24 | 2.42 | 2.67 | 74.2 | 52.7 | 45.7 |
Vista-LLaMA | - | 2.44 | 2.64 | 3.18 | 2.26 | 2.31 | 2.57 | 65.3 | 60.5 | 48.3 |
LLaMA-VID | - | 2.89 | 2.96 | 3.00 | 3.53 | 2.46 | 2.51 | 69.7 | 57.7 | 47.4 |
Chat-UniVi | - | 2.99 | 2.89 | 2.91 | 3.46 | 2.89 | 2.81 | 65.0 | 54.6 | 45.8 |
VideoChat2 | 51.1 | 2.98 | 3.02 | 2.88 | 3.51 | 2.66 | 2.81 | 70.0 | 54.1 | 49.1 |
ST-LLM | 54.9 | 3.15 | 3.23 | 3.05 | 3.74 | 2.93 | 2.81 | 74.6 | 63.2 | 50.9 |
Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:
CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight
We have also prepared local scripts that are easy to modify:demo.py
Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.
Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.
Git clone our repository, creating a Python environment and activate it via the following command
git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt
The instructions of data, training and evaluating can be found in trainval.md.
If you find the code and paper useful for your research, please consider staring this repo and citing our paper:
@article{liu2023one,
title={One for all: Video conversation is feasible without video instruction tuning},
author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
journal={arXiv preprint arXiv:2309.15785},
year={2023}
}
@article{liu2023one,
title={ST-LLM: Large Language Models Are Effective Temporal Learners},
author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
journal={https://arxiv.org/abs/2404.00308},
year={2023}
}