Official Github repository of
[LVD-2M: A Long-take Video Dataset with Temporally Dense Captions]()
Tianwei Xiong1,*, Yuqing Wang1,*, Daquan Zhou2,†, Zhijie Lin2, Jiashi Feng2, Xihui Liu1,✉
1The University of Hong Kong, 2ByteDance
*Equal contribution. †Project lead. ✉Corresponding author.
NeurIPS 2024 Track Datasets and Benchmarks
[2024/10/15] The dataset, the research paper and the project page are released!
LVD-2M is a dataset featuring:
We randomly sample 100 videos (Youtube source) from LVD-2M, users can download the videos and the annotation file.
We note that even a direct non-cherry picking random sample already presents decent quality.
We will remove the video samples from our dataset / demonstration if you find them inappropriate. Please contact xiongt20 at gmail dot com for the request.
We provide three splits of our video dataset according to their sources: Youtube, HDVG and WebVid.
You can download the three files from the links
The meta records should be put in the following paths:
data/ytb_600k_720p.csv
data/hdvg_300k_720p.csv
data/webvid_1200k_336_short.csv
Each row in the csv file corresponds to a video clip, the columns are:
raw_caption
: The captions generated by LLaVA-v1.6-next-34B. For long video clips, multiple captions seperated by "Caption x:" will be provided.refined_caption
: The refined captions generated by Claude3-Haiku, refining the raw_caption
into a consistent description of the whole video clip.rewritten_caption
: The rewritten captions generated by LLaMA-v3.1-70B, from the refined_caption
to a more concise user-input style.key
: The id of the video clip.video_id
: The id of the YouTube video. Note a youtube video can have mutiple video clips.url
: The url of the video. For youtube videos, it is the url of the video that the video clip is from. For webvid videos, it directly points to the video clip.dataset_src
: Where the video clip is from. Values can be [hdvg, panda70m, internvid, webvid].orig_caption
: The original caption of the video clip, given by its dataset_src
.total score
: The average optical flow score of the video clip.span
: The starting and ending time of the video clip in the original video, for video clips from YouTube only.video_time
: Then length of the video clip.orig_span
: (Trivial content) Special record for HDVG data format. It is a result of HDVG cutting video clips further into smaller clips.scene_cut
: (Trivial content) Special record for HDVG data format.conda create --name lvd2m python=3.9
conda activate lvd2m
# install ffmpeg
sudo apt-get install ffmpeg
pip install -r requirements.txt
To download videos from a csv file, run the following command:
${PYTHON_PATH} \
download_videos_release.py \
--bsz=96 \
--resolution=720p \
--node_num=1 \
--node_id=0 \
--process_num=96 \
--workdir=cache/download_cache \
--out_dir="dataset/videos" \
--dataset_key="hdvg" \
--multiprocess
Your google accounts may be banned or suspended for too many requets. So you are suggested to use multiple accounts. Set the ACCOUNT_NUM
in download_videos_release.py
to specify.
The video data is collected from publicly available resources. The license of this dataset is the same as License of HD-VILA.
Here we list the projects that inspired and helped us to build LVD-2M.
@article{xiong2024lvd2m,
title={LVD-2M: A Long-take Video Dataset with Temporally Dense Captions},
author={Tianwei Xiong and Yuqing Wang and Daquan Zhou and Zhijie Lin and Jiashi Feng and Xihui Liu},
year={2024},
journal={arXiv preprint arXiv:2410.10816}
}