Closed anhminh3105 closed 4 years ago
Hi, you may make a temporal window (e.g., window_size = 3s, steps = 0.5s) to convert the on-line stream to off-line clips. The prediction results of several clips (e.g., 3 clips) could be averaged together to obtain a reasonable accuracy for online-stream.
Thanks for your reply.
Apologies that I don't fully understand your idea, could you explain a bit more on making a temporal window to convert input stream to offline clips? Is what you mean similar to using a 3D convolutional layer?
I am thinking of feeding to DD-Net a pose keypoint volume input of shape (num_people_poses, 32, 15, 2) collected after 32 frames of the input stream. The action labels are then get assigned to the people poses and visualised Do you think it would also work?
Br.
(1) A simple way to use the temporal window: suppose you want the pose class at time T, to utilize the temporal information, you may use the poses information a few moments ago, with a temporal window W, then you collect pose information from time T-W to T, which can be feed into this model. How far away you want to use the old information? If it is too far, your pose action already changed; reversely, not enough temporal information to be used. That is something you may need to balance. After you have a window, how frequently do you want to do action classifications? You use a step L, so your next window will start at T-W+L to T+L. If you suppose the action class is similar within N steps, you may average the predicted action class score for N*T.
(2) For multiple people case, you may take the statistical values (e.g., mean, max, min) of features for several persons, and then use another network to fuse them.
Thank you for your supportive suggestion and detailed elaboration, I'm very appreciated.
I suppose that (2) would be for multiple people action, am I correct? In case of predicting multiple people and multiple actions, I suppose I would need to just average the predicted action for each person and the fusing network should not be needed, right?
Br.
Sorry for misunderstanding your points, when I saw Openpose I thought you were doing group activities recognition but you could use it for individuals by pose tracking. For multiple person action, it is not to average the final actions but could average the middle layer features. Anyhow, it seems to be unrelated.
I have tried to trained ddnet on the whole dataset by concatenating the splits into one in order to obtain better training results (94-95% val_acc). Yet the results when testing from input source of camera weren't good in the way that predictions sometimes flickered for poses that were a bit difficult that maybe the dataset didn't contain. For example, sitting but with a rather relaxed laid back pose rather than straight up, the model would flicker to 'stand', 'wave', etc. I wondered training on the combined splits of data didn't cut it and the training results weren't that representative. I tried to train on a subset of the dataset with a few selective classes (e.g. walk, stand, sit) but the problem remained.
I also tried to improve training performance in terms of the splits with selective class subsets of data and I managed to pull val_acc of each split over 80% using weighted class to alleviate the skewed effect since 'walk' has ~3 times more data compared to others. So I guessed this is more representative of what the model would perform in real life. When testing the inference performance of-course wasn't there.
note: Due to the data of 'pos_world' was normalised with the scale of the puppetflow. I couldn't obtain that data with just openpose, or at least I don't know how to do that, so I used 'pos_img' data instead and normalised it by mean (using your norm_scale() function).
Do you have any suggestions to improve it?
Br.
It is really appreciated that many of you help me to improve my code! I am not an expert on action recognition and still on the way to learn, but I would like to share what I know. Although this work use skeletons, but I find that RGB usually helps more to obtain better action recognition performance because it is easy to introduce noise in skeleton estimation, and we also lose the context information, which is very useful to identify the actions. For some real-applications, you even can use a simple but decent method https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/
Many thanks in return for your sharing. I'm going to look into it.
Br.
Hi Fan Yang.
Would it be possible to live demo the JHMDB model using pose output from Openpose on a webcam input stream?
I find your project very fascinating and am playing around it recently. I would like to try to combine your action recognition model with the Openpose pose estimation model and make a live demo programme.
Since Openpose uses a lightly different output schema from that of the JHMDB dataset I may need to modify the preprocessing procedure in your code and re-train the model to make 2 models compatible with each other.
However, I don't quite understand what you did in processing the data before feeding it to the model (specifically the zoom() function in data_generator()) and the normalisation done by the dataset (specifically the 'pos_world' data). Could you help elaborate it a bit? Also, I am thinking the nomalisation technique used by JHMDB (normalising w.r.t to frame size and the puppet flow) might be too specialised and could not be implemented on pose output of OpenPose. I hope I'm wrong here and if so, could you shed some light on how to do it as well.
Thanks in advance.