JunkyByte / easy_ViTPose

Easy and fast 2d human and animal multi pose estimation using SOTA ViTPose [Y. Xu et al., 2022] Real-time performances and multiple skeletons supported.
Apache License 2.0
129 stars 19 forks source link
computer-vision human-pose onnx openpose-skeleton pose-estimation python tensorrt torch transformers vitpose

easy_ViTPose

easy_ViTPose

Accurate 2d human and animal pose estimation

Open In Colab

Easy to use SOTA ViTPose [Y. Xu et al., 2022] models for fast inference.

We provide all the VitPose original models, converted for inference, with single dataset format output.

In addition to that we also provide a Coco-25 model, trained on the original coco dataset + feet https://cmu-perceptual-computing-lab.github.io/foot_keypoint_dataset/ Finetuning is not currently supported, you can check de43d54cad87404cf0ad4a7b5da6bacf4240248b and previous commits for a working state of train.py

[!WARNING] Ultralytics yolov8 has issue with wrong bounding boxes when using mps, upgrade to latest version! (Works correctly on 8.2.48)

Results

resimg

https://github.com/JunkyByte/easy_ViTPose/assets/24314647/e9a82c17-6e99-4111-8cc8-5257910cb87e

https://github.com/JunkyByte/easy_ViTPose/assets/24314647/63af44b1-7245-4703-8906-3f034a43f9e3

(Credits dance: https://www.youtube.com/watch?v=p-rSdt0aFuw )
(Credits zebras: https://www.youtube.com/watch?v=y-vELRYS8Yk )

Features

We run YOLOv8 for detection, it does not provide complete animal detection. You can finetune a custom yolo model to detect the animal you are interested in, if you do please open an issue, we might want to integrate other models for detection.

Benchmark:

You can expect realtime >30 fps with modern nvidia gpus and apple silicon (using metal!).

Skeleton reference

There are multiple skeletons for different dataset. Check the definition here visualization.py.

Installation and Usage

[!IMPORTANT] Install torch>2.0 with cuda / mps support by yourself. also check requirements_gpu.txt.

git clone git@github.com:JunkyByte/easy_ViTPose.git
cd easy_ViTPose/
pip install -e .
pip install -r requirements.txt

Download models

Export to onnx and tensorrt

$ python export.py --help
usage: export.py [-h] --model-ckpt MODEL_CKPT --model-name {s,b,l,h} [--output OUTPUT] [--dataset DATASET]

optional arguments:
  -h, --help            show this help message and exit
  --model-ckpt MODEL_CKPT
                        The torch model that shall be used for conversion
  --model-name {s,b,l,h}
                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
  --output OUTPUT       File (without extension) or dir path for checkpoint output
  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]

Run inference

To run inference from command line you can use the inference.py script as follows:

$ python inference.py --help
usage: inference.py [-h] [--input INPUT] [--output-path OUTPUT_PATH] --model MODEL [--yolo YOLO] [--dataset DATASET]
                    [--det-class DET_CLASS] [--model-name {s,b,l,h}] [--yolo-size YOLO_SIZE]
                    [--conf-threshold CONF_THRESHOLD] [--rotate {0,90,180,270}] [--yolo-step YOLO_STEP]
                    [--single-pose] [--show] [--show-yolo] [--show-raw-yolo] [--save-img] [--save-json]

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         path to image / video or webcam ID (=cv2)
  --output-path OUTPUT_PATH
                        output path, if the path provided is a directory output files are "input_name
                        +_result{extension}".
  --model MODEL         checkpoint path of the model
  --yolo YOLO           checkpoint path of the yolo model
  --dataset DATASET     Name of the dataset. If None it"s extracted from the file name. ["coco", "coco_25",
                        "wholebody", "mpii", "ap10k", "apt36k", "aic"]
  --det-class DET_CLASS
                        ["human", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe",
                        "animals"]
  --model-name {s,b,l,h}
                        [s: ViT-S, b: ViT-B, l: ViT-L, h: ViT-H]
  --yolo-size YOLO_SIZE
                        YOLOv8 image size during inference
  --conf-threshold CONF_THRESHOLD
                        Minimum confidence for keypoints to be drawn. [0, 1] range
  --rotate {0,90,180,270}
                        Rotate the image of [90, 180, 270] degress counterclockwise
  --yolo-step YOLO_STEP
                        The tracker can be used to predict the bboxes instead of yolo for performance, this flag
                        specifies how often yolo is applied (e.g. 1 applies yolo every frame). This does not have any
                        effect when is_video is False
  --single-pose         Do not use SORT tracker because single pose is expected in the video
  --show                preview result during inference
  --show-yolo           draw yolo results
  --show-raw-yolo       draw yolo result before that SORT is applied for tracking (only valid during video inference)
  --save-img            save image results
  --save-json           save json results

You can run inference from code as follows:

import cv2
from easy_ViTPose import VitInference

# Image to run inference RGB format
img = cv2.imread('./examples/img1.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# set is_video=True to enable tracking in video inference
# be sure to use VitInference.reset() function to reset the tracker after each video
# There are a few flags that allows to customize VitInference, be sure to check the class definition
model_path = './ckpts/vitpose-s-coco_25.pth'
yolo_path = './yolov8s.pth'

# If you want to use MPS (on new macbooks) use the torch checkpoints for both ViTPose and Yolo
# If device is None will try to use cuda -> mps -> cpu (otherwise specify 'cpu', 'mps' or 'cuda')
# dataset and det_class parameters can be inferred from the ckpt name, but you can specify them.
model = VitInference(model_path, yolo_path, model_name='s', yolo_size=320, is_video=False, device=None)

# Infer keypoints, output is a dict where keys are person ids and values are keypoints (np.ndarray (25, 3): (y, x, score))
# If is_video=True the IDs will be consistent among the ordered video frames.
keypoints = model.inference(img)

# call model.reset() after each video

img = model.draw(show_yolo=True)  # Returns RGB image with drawings
cv2.imshow('image', cv2.cvtColor(img, cv2.COLOR_RGB2BGR)); cv2.waitKey(0)

[!NOTE] If the input file is a video SORT is used to track people IDs and output consistent identifications.

OUTPUT json format

The output format of the json files:

{
    "keypoints":
    [  # The list of frames, len(json['keypoints']) == len(video)
        {  # For each frame a dict
            "0": [  #  keys are id to track people and value the keypoints
                [121.19, 458.15, 0.99], # Each keypoint is (y, x, score)
                [110.02, 469.43, 0.98],
                [110.86, 445.04, 0.99],
            ],
            "1": [
                ...
            ],
        },
        {
            "0": [
                [122.19, 458.15, 0.91],
                [105.02, 469.43, 0.95],
                [122.86, 445.04, 0.99],
            ],
            "1": [
                ...
            ]
        }
    ],
    "skeleton":
    {  # Skeleton reference, key the idx, value the name
        "0": "nose",
        "1": "left_eye",
        "2": "right_eye",
        "3": "left_ear",
        "4": "right_ear",
        "5": "neck",
        ...
    }
}

Finetuning

Finetuning is possible but not officially supported right now. If you would like to finetune and need help open an issue.
You can check train.py, datasets/COCO.py and config.yaml for details.


Evaluation on COCO dataset

  1. Download COCO dataset images and labels

  2. Run the following command:

    
    $ python evaluation_on_coco.py
    
    Command line arguments:
        --model_path: Path to the pretrained ViT Pose model
    
        --yolo_path: Path to the YOLOv8 model
    
        --img_folder_path: Path to the directory containing COCO val images (/val2017 extracted in step 1). 
    
        --annFile: Path to json file for COCO keypoints for val set (annotations/person_keypoints_val2017.json extracted in step 1)

Docker

The system may be built in a container using Docker. This is intended to demonstrate container-wise inference, adapt it to your own needs by changing models and skeletons:

docker build . -t easy_vitpose

The image is based on NVIDIA's PyTorch image, which is 20GB large. If you have a compatible GPU set up with NVIDIA Container Toolkit, ViTPose will run with hardware acceleration.

To test an example, create a folder called cats with a picture of a cat as image.jpg. Run ./models/download.sh to fetch the large yolov8 and ap10k ViTPose models. Then run inference using the following command (replace with the correct cats and models paths):

docker run --gpus all --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ./models:/models -v ~/cats:/cats easy_vitpose python inference.py --det-class cat --input /cats/image.jpg --output-path /cats --save-img --model /models/vitpose-l-ap10k.onnx --yolo /models/yolov8l.pt

The result image may be viewed in your cats folder.

TODO:

Feel free to open issues, pull requests and contribute on these TODOs.

Reference

Thanks to the VitPose authors and their official implementation ViTAE-Transformer/ViTPose.
The SORT code is taken from abewley/sort