bruceyo / MMNet

This repository holds the codebase, dataset and models for the T-PAMI 2022 work:

MMNet: A Model-based Multimodal Network for Human Action Recognition in RGB-D Videos

Bruce X.B. Yu, Yan Liu, Xiang Zhang, Sheng-hua Zhong, Keith C.C. Chan


Human action recognition (HAR) in RGB-D videos has been widely investigated since the release of affordable depth sensors. Currently, unimodal approaches (e.g., skeleton-based and RGB video-based) have realized substantial improvements with increasingly larger datasets. However, model-based data fusion has seldom been investigated at the model level specifically. In this paper, we propose a model-based multimodal network (MMNet) that fuses skeleton and RGB modalities via a model-based approach. The objective is to improve ensemble recognition accuracy by effectively applying mutually complementary information from different data modalities. For the model-based fusion scheme, we use a spatiotemporal graph convolution network for the skeleton modality to learn attention weights that will be transferred to the network of the RGB modality. The whole model can be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA Multiview. Upon aggregating the results of multiple modalities, our method is found to consistently outperform state-of-the-art approaches; thus, the proposed MMNet can effectively capture mutually complementary features in different RGB-D video modalities and provide more discriminative features for HAR.



cd torchlight; python install; cd ..

Data Preparation



NTU RGB+D can be downloaded from their website. The 3D skeletons(5.75GB) modality and the RGB modality are required in our experiments. After that, this command should be used to build the database for training or evaluation:

python tools/ --data_path <path to nturgbd+d_skeletons>

where the <path to nturgbd+d_skeletons> points to the 3D skeletons modality of NTU RGB+D dataset you download.

Since the processed data is quite large (around 40.7G in total), we do not provide it here.


NTU RGB+D 120 can be downloaded from their website. The 3D skeletons(4.45GB) modality and the RGB modality are required in our experiments. After that, this command should be used to build the database for training or evaluation:

python tools/ --data_path <path to nturgbd+d_skeletons>

where the <path to nturgbd+d_skeletons> points to the 3D skeletons modality of NTU RGB+D dataset you download.

Since the processed data is quite large (around 82G in total), we do not provide it here.


The dataset can be found in PKU-MMD. PKU-MMD is a large action recognition dataset that contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. We transfer the 3D skeleton modality to separate action repetition files with the command:

python tools/utils/

After that, this command should be used to build the database for training or evaluation:

python tools/ --data_path <path to pku_mmd_skeletons>

where the <path to nturgbd+d_skeletons> points to the 3D skeletons modality of PKU-MMD dataset you processed with the above command.

For evaluation, the processed data includes: val_data and val_label are available from GoogleDrive. Please manually put it in folder: ./data/PKU_MMD

Northwestern-UCLA Multiview

The Multiview 3D event dataset is capture by Wangjian and Xiaohan Nie in UCLA. It contains RGB, depth and human skeleton data captured simultaneously by three Kinect cameras. This dataset include 10 action categories: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw, carry. Each action is performed by 10 actors. This dataset contains data taken from a variety of viewpoints.

The dataset can be found in part-1, part-2, part-3, part-4, part-5, part-6, part-7, part-8, part-9, part-10, part-11, part-12, part-13, part-14, part-15, part-16.

RGB videos could be downloaded from: RGB videos, which is used to generate 2D skeleton data by using OpenPose. The reformated skeleton data (multiview_action_skeleton) for data preparation ( is available via Google Drive.

2D Skeleton Retrieval from the RGB Video Input

After installing the Openpose tool, run

sh tools/openpose_skeleton_retrieval/2D_Retrieve_<dataset>.sh

where the <dataset> must be ntu_rgbd60, ntu_rgbd120, pku_mmd or nucla, depending on the dataset you want to use. We also provide the retrieved OpenPose 2D skeleton data for both datasets, which could downloaded from GoogleDrive (OpenPose NTU RGB+D 60), GoogleDrive (OpenPose NTU RGB+D 120), GoogleDrive (OpenPose PKU-MMD) and GoogleDrive (OpenPose N-UCLA Multiview).

Prepare the RGB Modality

We provide our Matlab code used to convert RGB video to individual frames. You could have your own implementation via other programming languages. Using our code, you can adapt the code in folder tools/rgb_video_2_frames to your environment by replacing the folder parameters (Lines 70 and 76 for NTU RGB+D datasets, and Lines 41 and 47 for PKU-MMD). For Northwestern-UCLA Multiview, it provides the RGB frames.

Generate Region of Interest

python tools/data_gen/gen_fivefs_<dataset>

where the <dataset> must be ntu_rgbd or pku_mmd, depending on the dataset you want to use.

The processed ROI of NTU-RGB+D is available from GoogleDrive; The processed ROI of PKU-MMD is available from GoogleDrive.

Testing Pretrained Models

You may download the trained models reported in the paper via GoogleDrive and put them in folder models. And also download the results reported in the paper via GoogleDrive and put them in folder results.

Evaluate on NTU RGB+D 60

For evaluation in NTU RGB+D 60, run

python recognition -c config/ntu60_<evaluation protocol>/<evaluation protocol>/test_rgb_fused.yaml

where <evaluation protocol> is the evaluation protocol e.g., xsub and xview.

Check the ensemble:

python ./ensemble/ --protocol <evaluation protocol>

Evaluate on NTU RGB+D 120

For evaluation in NTU RGB+D 120, run

python recognition -c config/ntu120_<evaluation protocol>/test_rgb_fused.yaml

where <evaluation protocol> is the evaluation protocol e.g., xsub and xset.

Check the emsemble:

python ./ensemble/ --protocol <evaluation protocol>

Evaluate on PKU-MMD

For evaluation in PKU-MMD, run

python recognition -c config/pku_<evaluation protocol>/test_rgb_fused.yaml

where <evaluation protocol> is the evaluation protocol e.g., xsub and xview.

Check the emsemble:

python ./ensemble/ --protocol <evaluation protocol>

Evaluate on Northwestern-UCLA Multiview

For evaluation in Northwestern-UCLA Multiview, run

python recognition -c config/nucla_<evaluation protocol>/test_rgb_fused.yaml

where <evaluation protocol> is the evaluation protocol e.g., 123, 132, and 231.

Check the emsemble:

python ./ensemble/ --protocol <evaluation protocol>


To train a new MMNet, you need to train submodels for three inputs: skeleton joint, skeleton bone, and RGB video.

For skeleton joint, run

python recognition -c config/<dataset>/train_joint.yaml [--work_dir <work folder>]

For skeleton bone, run

python recognition -c config/<dataset>/train_bone.yaml [--work_dir <work folder>]

For RGB video, run

python recognition -c config/<dataset>/train_rgb_fused.yaml [--work_dir <work folder>]

where the <dataset> must be ntu60_xsub, ntu60_xview, ntu120_xsub, ntu120_xset, pku_xsub, pku_xview, nucla_123, nucla_132 or nucla_231, depending on the dataset you want to use. The training results, including model weights, configurations and logging files, will be saved under the ./work_dir by default or <work folder> if you appoint it.

You can modify the training parameters such as work_dir, batch_size, step, base_lr and device in the command line or configuration files. The order of priority is: command line > config file > default parameter. For more information, use -h.


Finally, custom model evaluation can be performed by the following commands: For skeleton joint, run

python recognition -c config/<dataset>/test_joint.yaml [--work_dir <work folder>]

For skeleton bone, run

python recognition -c config/<dataset>/test_bone.yaml [--work_dir <work folder>]

For RGB video, run

python recognition -c config/<dataset>/test_rgb_fused.yaml --weights <path to model weights>

where the <dataset> must be ntu60_xsub, ntu60_xview, ntu120_xsub, ntu120_xset, pku_xsub, pku_xview, nucla_123, nucla_132 or nucla_231, depending on the dataset you want to use.

Ensemble Results in the Paper

After get the predictions from skeleton joint, skeleton bone, and the RGB video input, we can get the ensemble result by aggregating the results with the command:

python ./ensemble/ensemble_<dataset>.py --protocol <evaluation protocol>

where <dataset> is the name of a dataset, e.g., ntu60, ntu120, pku, and nucla;<evaluation protocol> is the evaluation protocol provided by the corresponding dataset, e.g., xsub and xview for NTU-RGB+D.


This repo is based on our previous repo

Thanks to the original authors for their work!


If you find this work helpful, please cite our work:

  author={Yu, Bruce X.B. and Liu, Yan and Zhang, Xiang and Zhong, Sheng-hua and Chan, Keith C.C.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={MMNet: A Model-based Multimodal Network for Human Action Recognition in RGB-D Videos}, 


For any question, feel free to contact Bruce Yu: b r u c e x b y u AT space)