ami-iit / element_human-action-intention-recognition

8 stars 0 forks source link

Add state of the art and relevant papers to the element #2

Closed kouroshD closed 4 years ago

kouroshD commented 4 years ago

In this issue, I should add a small documentation of the state of the art of element, according to the objectives of the element.

kouroshD commented 4 years ago

\textbf{Works has been done by Serena's Team} \textit{Intention recognition} is defined as ability to predict the upcoming or the ongoing action of the human or other physical agent as well as predicting the manifold in which the action has been performed \cite{Dermy2019}. From the human perspective the human intention recognition/prediction is correlated to the \textit{legibility} and \textit{predictability} of the motion. A user infer the goal of a legible motion quickly, and when the motion matches the user expectation given its goal is predictable. One of the sub-problems in movement recognition is the estimation of the trajectory/action duration \cite{Dermy2017} .

\cite{Dermy2017} proposed a framework for human intention prediction during physcial human-robot collaboration. To do so, the robot should recognize the task in which human is doing, predict the future trajectory and complete the movement autonomously. In their scenario, the human guides the robot at the beginning of the trajectory, and when he releases the robot hand, the robot should recognize the human intention and follow the future intended trajectory. Assuming all the human motions are goal-oriented, recognizing the human intention means finding the goal. A probabilistic movement primitive (ProMP) approach has been used to model the human action/skill. Using ProMP's representation, we learn the weights $\omega \in R^{M}$ for $M$ radial basis functions (RBF) in vector $\phi(t)$ and they have used Gaussian functions. Performing an action by humans may take long differently. Therefore, in human action/intention recognition we need time modulation $\alpha$ at the learning phase, and while inferring we need to estimate $\alpha$ prior to execution. In this paper, they have proposed four different methods to estimate the value $\hat{\alpha}$ given the human's performing skill $\hat{k}$; namely: 1) \textbf{mean} of all the $\alpha_i$'s values, 2) \textbf{maximum likelihood} method; using one of the $\alpha_i$ among the demonstrations such that the likelihood of the observed data and the trained parameters are maximized; 3) \textbf{minimum distance} in which the distance between the observed data and the modeled skill in minimized, 4) \textbf{model} the time modulation as another ProMP's problem. Having the set observed data, we compute the time modulation factor for each skill using the methods described before. Later, among different modelled skill, we choose the one minimum distance between the model and the observed data. Finally, we update the posterior distributions % given Equation 8/16 of the paper and find the inferred future trajectory. They have done the experiments in simulation using haptic feedback device and with real iCub.

Gaze and head/face direction have been used as the fundamental medium to recognize the intention coupled with the \textit{context} in which the action has been performed. The gaze and face direction information are used as a priori to detect the intended action \cite{Dermy2019}.

Using ProMP's, \cite{Dermy2019} show that face tracking increases the recognition performance with respect to the case that haptic information has been used . However, the accuracy of the posterior distribution, i.e., trajectory prediction, is higher only if using haptic information.

\textbf{Questions:} 1- How we know the starting point of an action!? Are we using a moving window!? 2- is the point or time step, that we use the observed data to predict and recognize the intention, fixed or we do recognition and prediction continuously while evolving and going forward in time?! 3- How we give label to a motion? Do we use maximum value for the probability distribution of different actions at the recognition time or it is more complex algorithm? 4- unrelated question: what is the control mode of the robot in which we use for haptic guidance!?

In \cite{Malaise2019}, authors have proposed an automatic method for automatic ergonomic assessment. For human action recognition, Hidden Markov Model (HMM) method recognizes the human activity among a taxonomy of activities used for automatic ergonomic assessment. The taxonomy of the ergonomic assessment is based on Ergonomic Assessment Worksheet (EAWS) to evaluate the bio-mechanical factors of Work-related Musculoskeletal Disorders (WMSDs), i.e., posture, external force, manipulated load, task repetitiveness. Offline, they have annotated four levels of activities for ergonomic assessment, namely \textit{main postures} (e.g., standing, walking, sitting), \textit{torso and arms configuration} (e.g., Upright, bent forward, strong bent), \textit{full postural information} composed of previous two levels (e.g., standing upright, standing bent forward), and \textit{Goal-Oriented actions} (e.g., reaching, picking, placing, release). To model the human activities they have used data from Xsens MVN (including position, orientation velocities and accelerations, CoM information) and contact e-glove including finger flexion (3 values) and finger/palm pressure sensors (4 values). After annotating the data manually by human subjects, they used different methods to reduce the dimension of the data, including wrapper-based method, filter-based mothod, and principal component analysis (PCA). Using the reduced set of features they have trained models of the human actions using HMM. Online, the arrived data are examined and classified.

\textbf{Questions:} 1- How we manage the differences on the duration of the actions? 2- in the data processing part, it has been mentioned that the window for action recognition is 250 ms, and with 50 \% of overlap with previous window. What about actions which takes longer or different time duration. For example reaching action wrt to sitting action. Some of the actions are more like a state than an action, for example standing, or sitting, while some others are actions such as reaching, picking. 3- how is the precision in the transient phase of the actions? 4- What is the output of the HMM online, I suppose it should be a vector of continuous values for different actions, and then how it has been classified according to this vector.

Aligned with human intention prediction, \cite{Dermy2018} extends the prediction to whole body movement of humans. Using the data coming from Xsens, they use $3 \times 23$ cartesian positions in space of the fixed-based human model. They reduce the dimension of the data using auto-encoders (AE), i.e., encoding the trajectories in latent space, and decode the predicted trajectory from latent space to high-dimensional original. They model the human motions using probabilistic motion primitives (ProMPs) in latent space to increase the efficiency. In this paper they have modeled 7 different human actions. Moreover, they have applied the dimension reduction by \textit{variational time series feature extractors}, but the total performance of the prediction with AE was better with repsect to second method.

[1] O. Dermy, F. Charpillet, and S. Ivaldi, “Multi-modal intention predictionwith probabilistic movement primitives,” inHuman Friendly Robotics.Springer, 2019, pp. 181–196. [2] O. Dermy, A. Paraschos, M. Ewerton, J. Peters, F. Charpillet, andS. Ivaldi, “Prediction of intention during interaction with icub withprobabilistic movement primitives,”Frontiers in Robotics and AI, vol. 4,p. 45, 2017. [3] A. Malais ́e, P. Maurice, F. Colas, and S. Ivaldi, “Activity recognitionfor ergonomics assessment of industrial tasks with automatic featureselection,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp.1132–1139, 2019. [4] O. Dermy, M. Chaveroche, F. Colas, F. Charpillet, and S. Ivaldi,“Prediction of human whole-body movements with ae-promps,” in2018 IEEE-RAS 18th International Conference on Humanoid Robots(Humanoids). IEEE, 2018, pp. 572–579.

kouroshD commented 4 years ago

Since the papers are described here, I close this issue.