VIPL-SLP / pointlstm-gesture-recognition-pytorch

This repo holds the codes of paper: An Efficient PointLSTM for Point Clouds Based Gesture Recognition (CVPR 2020).
https://openaccess.thecvf.com/content_CVPR_2020/html/Min_An_Efficient_PointLSTM_for_Point_Clouds_Based_Gesture_Recognition_CVPR_2020_paper.html
Apache License 2.0
121 stars 20 forks source link

How to use model? #12

Closed watermellon2018 closed 3 years ago

watermellon2018 commented 3 years ago

How to use the model in some picture or webcame? I want to classification a gesture in picture, I try to load the pretrained model, but get error Thanks

checkpoint = torch.load('/content/epoch200_model.pt')

model = Motion(num_classes=28, pts_size=128, offsets=False, topk=16, knn=(16, 48, 48, 12))
opt_dict = checkpoint['optimizer_state_dict']
optimizer = Optimazer(model, opt_dict)

model.load_state_dict(checkpoint['model_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.eval()
watermellon2018 commented 3 years ago

Okey, i done it But how to preproccess image? How i can use the pretrained model for custom image? I try model(img) but it dont work

ycmin95 commented 3 years ago

@watermellon2018 Thanks for your attention, the proposed method is aim to recognition gestures from point clouds, which are sampled from depth images. The proposed method is evaluated on SHREC'17 and Nvidia, you can use the pretrained model and the similar conditions as these datasets, or you can collect your own dataset and train from scratch.

watermellon2018 commented 3 years ago

Thanks for your reply. I was able to run model for predict the gestures by video. Do I correct understand that to run model i need get points cloud from picture (video frame). I get points cloud with your script in dataset/nvidea_process.py for every frame:

  1. Apply cv2.threshold
  2. Apply function save_largest_label
  3. Apply cv2.erode
  4. And generate points clouds and take N randomly points
  5. Then with function uvd2xyz_sherc get new coordinates

Then i get array with shape (529, 128, 8). For run model and economy gpu memory i take array with shape (33, 128, 8) (batch) and add new axis. In the end i get array (1, 33, 128, 8) which i pass into model and get predict for this time frame. I did it 529//33 time

Please, tell me i correct understood your algorithm or it is not correct path for predict gesture by video

ycmin95 commented 3 years ago

@watermellon2018 Yes, steps 1-5 aim to sampling point clouds, you can change them based on your own data. Why do you do it "529//33 time"? Do you mean ensemble?

watermellon2018 commented 3 years ago

First i have array with shape 529, 128, 8. If i passed this into model, memory gpu fall, so I decided to split the array along the time axis and go to the model array with the form 33,128,8. To pass the entire array through the model, I have to run the data through the model 17 times I correctly undestard that 1 axis (529) is time axis, 2 axis - points?

watermellon2018 commented 3 years ago

I have one more question: Before i apply steps 1-5 above, i need to detect hands and make this steps above regions hand? Or i can make this steps over the whole picture with person?

ycmin95 commented 3 years ago

@watermellon2018 It will be more accurate if you can locate the hand region for gestures that are more relevant to hand postures. Some gestures are related to the whole human pose, which can also be recognized by the proposed methods, which can be found in our experiments on MSRAction (CVPR'20) & UBPG (BMVC'19) datasets.

ycmin95 commented 3 years ago

First i have array with shape 529, 128, 8. If i passed this into model, memory gpu fall, so I decided to split the array along the time axis and go to the model array with the form 33,128,8. To pass the entire array through the model, I have to run the data through the model 17 times I correctly undestard that 1 axis (529) is time axis, 2 axis - points?

Simply match the code. I think previous work about online gesture recognition may be also helpful for your situation: Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Network.

watermellon2018 commented 3 years ago

Thank you for your answer. I can made the predict gesture by video, but its not high accuracy for me :( I will be collect own dataset and train. But I would like to ask one more question: in SHREC17 dataset exist 14 labels (gesture) and 28. I dont understand about 28 gesture. It label for each hand (left/right) and the type geture?

ycmin95 commented 3 years ago

@watermellon2018 You can find examples in this page, more details can be found in the official site.

ycmin95 commented 3 years ago

Hope you can find the information you need, feel free to open it again~