LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Official Github repository of
[LVD-2M: A Long-take Video Dataset with Temporally Dense Captions]()

Tianwei Xiong^1,*, Yuqing Wang^1,*, Daquan Zhou^2,†, Zhijie Lin², Jiashi Feng², Xihui Liu^1,✉

¹The University of Hong Kong, ²ByteDance
*Equal contribution. †Project lead. ^✉Corresponding author.

NeurIPS 2024 Track Datasets and Benchmarks

News

[2024/10/15] The dataset, the research paper and the project page are released!

Introduction

LVD-2M is a dataset featuring:

long videos covering at least 10 seconds
long-take videos without cuts
large motion and diverse contents
temporally dense captions.

Dataset Statistics

alt text

Dataset Access

Quick Walk Through for 100 Randomly Sampled Videos

We randomly sample 100 videos (Youtube source) from LVD-2M, users can download the videos and the annotation file.

We note that even a direct non-cherry picking random sample already presents decent quality.

We will remove the video samples from our dataset / demonstration if you find them inappropriate. Please contact xiongt20 at gmail dot com for the request.

File Downloading

We provide three splits of our video dataset according to their sources: Youtube, HDVG and WebVid.

You can download the three files from the links

The meta records should be put in the following paths:

data/ytb_600k_720p.csv
data/hdvg_300k_720p.csv
data/webvid_1200k_336_short.csv

Explanations for the Fields of the Meta Files:

Each row in the csv file corresponds to a video clip, the columns are:

raw_caption: The captions generated by LLaVA-v1.6-next-34B. For long video clips, multiple captions seperated by "Caption x:" will be provided.
refined_caption: The refined captions generated by Claude3-Haiku, refining the raw_caption into a consistent description of the whole video clip.
rewritten_caption: The rewritten captions generated by LLaMA-v3.1-70B, from the refined_caption to a more concise user-input style.
key: The id of the video clip.
video_id: The id of the YouTube video. Note a youtube video can have mutiple video clips.
url: The url of the video. For youtube videos, it is the url of the video that the video clip is from. For webvid videos, it directly points to the video clip.
dataset_src: Where the video clip is from. Values can be [hdvg, panda70m, internvid, webvid].
orig_caption: The original caption of the video clip, given by its dataset_src.
total score: The average optical flow score of the video clip.
span: The starting and ending time of the video clip in the original video, for video clips from YouTube only.
video_time: Then length of the video clip.
orig_span: (Trivial content) Special record for HDVG data format. It is a result of HDVG cutting video clips further into smaller clips.
scene_cut: (Trivial content) Special record for HDVG data format.

Environment

conda create --name lvd2m python=3.9
conda activate lvd2m

# install ffmpeg
sudo apt-get install ffmpeg

pip install -r requirements.txt

Video Downloading Script

To download videos from a csv file, run the following command:

${PYTHON_PATH} \
download_videos_release.py \
--bsz=96 \
--resolution=720p \
--node_num=1 \
--node_id=0 \
--process_num=96 \
--workdir=cache/download_cache \
--out_dir="dataset/videos" \
--dataset_key="hdvg" \
--multiprocess

Your google accounts may be banned or suspended for too many requets. So you are suggested to use multiple accounts. Set the ACCOUNT_NUM in download_videos_release.py to specify.

Details for Video Downloading

We don't provide the video data directly, instead we provide ways to download the videos from their original sources. Although HDVG dataset is also from youtube, its format is different from other youtube scraped datasets, so it is treated seperately. ### Technical suggestions for downloading videos from YouTube We use a modified version of [pytube](https://github.com/pytube/pytube) to download the videos. It supports downloading videos from youtube in a parallel, fast and stable way (using multiprocessing and multiple accounts). For more details, check the `download_videos_release.py` script. Overally, users are suggested to prepare multiple google accounts, run `python download_videos_release.py --reset_auth` for authorization and run the downloading scripts. We implemented the mechanism of dividing the request loads to multiple accounts. The processes launched on all the nodes will be evenly assigned to different accounts. *Note: the code for downloading videos from youtube could fail due to variation in youtube api behaviors, you can check the issues in [pytube](https://github.com/pytube/pytube) for updates.* ### Disclaimer about WebVid We **don't provide** code for downloading videos from **webvid** (whose videos are from stock footage providers) for two reasons: 1. Users can directly access these video clips through the provided urls, which is much simper than video clips from youtube. 2. To avoid possible violation of copyrights.

License

The video data is collected from publicly available resources. The license of this dataset is the same as License of HD-VILA.

Acknowledgements

Here we list the projects that inspired and helped us to build LVD-2M.

Panda-70M, HD-VG-130M, InternVid and WebVid are the sources of our video data.
RAFT, PLLaVA, LLaVA-Next-v1.6 are important parts for our data pipeline.
PySceneDetect and pytube provide effective tools for video data collection.

Citation

@article{xiong2024lvd2m,
      title={LVD-2M: A Long-take Video Dataset with Temporally Dense Captions}, 
      author={Tianwei Xiong and Yuqing Wang and Daquan Zhou and Zhijie Lin and Jiashi Feng and Xihui Liu},
      year={2024},
      journal={arXiv preprint arXiv:2410.10816}
}

SilentView / LVD-2M

readme