This is an implementation of paper : YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition.
Hello, thank you everyone for your attention to this study. If you find it valuable, please consider leaving a star, as it would greatly encourage me.
If you intend to use this repository for your own research, please consider to cite:
@misc{dang2024yowov3efficientgeneralizedframework,
title={YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition},
author={Duc Manh Nguyen Dang and Viet Hang Duong and Jia Ching Wang and Nhan Bui Duc},
year={2024},
eprint={2408.02623},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.02623},
}
I am very pleased that everyone has shown interest in this project. There are many questions being raised, and I am more than willing to answer them as soon as possible. However, if you have any doubts about the code or related matters, please provide me with context (config file, some samples that you couldn't detect, a checkpoint, etc.). Also, please use English.
In this Instruction, I will divide it into smaller sections, with each section serving a specific purpose. I will provide a summary of this Instruction structure in order right below. Please read carefully to locate the information you are looking for.
Clone this repository
git clone https://github.com/AakiraOtok/YOWOv3.git
Use Python 3.8 or Python 3.9, and then download the dependencies:
pip install -r requirements.txt
Note: On my system, I use Python 3.7 with slightly different dependencies, specifically for torch:
torch==1.13.1+cu117
torchaudio==0.13.1+cu117
torchvision==0.14.1+cu117
However, when testing on another system, it seems that these versions have been deprecated. I have updated the requirements.txt file and tested it again on systems using Python 3.8 and Python 3.9, and everything seems to be working fine. If you encounter any errors during the environment setup, please try asking in the "issues" section. Perhaps someone has faced a similar issue and has already found a solution.
The project is designed in such a way that almost every configuration can be adjusted through the config file. In the repository, I have provided two sample config files: ucf_config.yaml and ava_config.yaml for the UCF101-24 and AVAv2.2 datasets, respectively. The Basic Usage section will not involve extensive modifications of the config file, while the customization of the config will be covered in the Customization section.
Warning!: Since all configurations are closely related to the config file, please carefully read the part Modify Config file in the Customization section to be able to use the config file correctly.
We have the following command template:
python main.py --mode [mode] --config [config_file_path]
Or the shorthand version:
python main.py -m [mode] -cf [config_file_path]
For [mode] = {train, eval, detect, live, onnx}
for training, evaluation, detection (visualization on the current dataset), live (camera usage) or export to onnx and inference respectively. The[config_file_path]
is the path to the config file.
Example of training a model on UCF101-24:
python main.py --mode train --config config/ucf_config.yaml
Or try evaluating a model on AVAv2.2:
python main.py -m eval -cf config/ava_config.yaml
There are some notes about the config file:
build_config
function in utils/build_config.py
.import yaml
def build_config(config_file='config/ucf_config.yaml'):
with open(config_file, "r") as file:
config = yaml.load(file, Loader=yaml.SafeLoader)
if config['active_checker']:
pass
return config
I know why you are here, my friend =)))))
You can build a custom dataset for yourself, however, make sure to carefully read the notes below to do it correctly.
Firstly, every time you want to use a dataset, simply call the build_dataset
function as shown in the example code:
dataset = build_dataset(config, phase='train')
The build_dataset
function is defined in datasets/build_dataset.py
as follows:
from datasets.ucf.load_data import build_ucf_dataset
from datasets.ava.load_data import build_ava_dataset
def build_dataset(config, phase):
dataset = config['dataset']
if dataset == 'ucf':
return build_ucf_dataset(config, phase)
elif dataset == 'ava':
return build_ava_dataset(config, phase)
To accommodate your needs, you simply need to define the build_custom_dataset
function for your specific purpose and modify the above build_dataset
function accordingly.
The model is generalized to train on multi-action datasets, meaning that each box may have multiple actions simultaneously. However, metrics for one box - one action are more common than one box - multi-action. Therefore, I will guide you on evaluation as one box - one action, while training as one box - multi-action for generalization.
The build_dataset
function returns custom-defined dataset classes. There are two important parameters to consider: config
and phase
. config
is a dictionary containing options as in the config file (loaded beforehand), with nothing particularly special. On the other hand, phase
has two values: train or test. train is used for training, and test is used for detection/evaluation/live stages.
Let:
You need to return:
clip: a tensor with shape $[C, D, H, W]$ representing the clip to be detected.
boxes: a tensor with shape $[N, 4]$, containing the coordinates x_top_left, y_top_left, x_bottom_right, y_bottom_right of the truth boxes.
labels:
Additionally, to use detect.py, the get_item function also needs an additional parameter get_origin_image. If this parameter is set to True, it should return the original unaltered image.
Please note that index class start at $0$.
To evaluate, use ucf_eval.py
.
All pre-trained models for backbone2D, backbone3D and model checkpoints are publicly available on my Hugging Face repo.
Regarding the model checkpoints, I have consolidated them into an Excel file that looks like this:
Each cell represents a model checkpoint, displaying information such as mAP, GLOPs, and # param in order. The checkpoints are stored as folders named after the corresponding cells in the Excel file (e.g., O27, N23, ...). Each folder contains the respective config file used for training that model. Please note that both the regular checkpoint and the exponential moving average (EMA) version of the model are saved.
Warning!: Since all configurations are closely related to the config file, please carefully read the part Modify Config file in the Customization section to be able to use the config file correctly.
The architecture of YOWOv3 does not differ much from YOWOv2, although initially I had planned to try a few things. Firstly, the feature map from the 3D branch does not go through path aggregation but merges with the 2D branch and is used by the model for predictions directly. This makes the architecture look quite simple, and I believe it will have a significant impact on performance. Another thing is that in this paper, the authors propose an alternative method to replace the decoupled head called Task-aligned head, which can avoid repeating attention modules in YOWOv3 and make the model much lighter.
The project was developed gradually through stages and gradually expanded, so there are still some areas that are not comprehensive enough. For example, evaluating on AVA v2.2 took a lot of time because I did not parallelize this process (batch_size = 1). The reason for this is that the format required by AVA v2.2 demands an additional identity for the bounding box, which I did not set up in the initial evaluation code as it was only serving experiments on UCF101-24 at that time.
I would like to express my sincere gratitude to the following amazing repositories/codes, which were the primary sources I heavily relied on and borrowed code from during the development of this project: