ECHO960 / PKU-MMD

Codes for PKU-MMD dataset.
Apache License 2.0
52 stars 8 forks source link

PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding

[Spatial and Temporal Resolution Up Conversion Team, ICST, Peking University](http://www.icst.pku.edu.cn/struct)
This dataset is partially funded by Microsoft Research Asia, project ID FY17-RES-THEME-013. ![Teaser](Imgs/teaser.png)
Fig.1 PKU Multi-Modality Dataset is a large-scale multi-modalities action detection dataset. This dataset contains 2 phases, phases #1 contains 51 action categories, performed by 66 distinct subjects in 3 camera views.

Abstract

PKU-MMD is a new large scale benchmark for continuous multi-modality 3D human action understanding and covers a wide range of complex human activities with well annotated information. PKU-MMD contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Our dataset also provides multi-modality data sources, including RGB, depth, Infrared Radiation and Skeleton.

Cite

@article{liu2017pku, 
  title={PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding},
  author={Chunhui, Liu and Yueyu, Hu and Yanghao, Li and Sijie, Song and Jiaying, Liu},
  journal={ACM Multimedia workshop},
  year={2017}
}

Resources

Paper: ACM Multimedia workshop

Code: Evaluation protocol
Project Webpage: http://39.96.165.147/Projects/PKUMMD/PKU-MMD.html

Dataset Description

PKU-MMD is our new large-scale dataset focusing on long continuous sequences action detection and multi-modality action analysis. The dataset is captured via the Kinect v2 sensor.

Phase #1 contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Each video lasts about 3~4 minutes (recording ratio set to 30 FPS) and contains approximately 20 action instances. The total scale of our dataset is 5,312,580 frames of 3,000 minutes with 21,545 temporally localized actions. We choose 51 action classes in total, which are divided into two parts: 41 daily actions (drinking, waving hand, putting on the glassed, etc.) and 10 interaction actions (hugging, shaking hands, etc.). 66 distinct subjects are invited for our data collection. Each subjects takes part in 4 daily action videos and 2 interactive action videos.our videos only contain one part of the actions, either daily actions or interaction actions. We design 54 sequences and divide subjects into 9 groups, and each groups randomly choose 6 sequences to perform.

We provide 5 categories of resources: depth maps, RGB images, skeleton joints, infrared sequences, and RGB videos.

Data Format

More Samples

![Teaser](Imgs/overview.png) Fig.2 From top to bottom, these four rows show RGB, depth, skeleton and IR modalities, respectively.
![Teaser](Imgs/samples.png) Fig.3 We collect 51 actions performed by 66 subjects, including actions for single and pairs.

Last update: Oct 2017