The sampling controlller needs control actions along an horizon $T = N \cdot \Delta t$. This raises two questions :
How can the RL policy learn control actions along an horizon ?
Under wich form should the control actions be ?
For clarification, the control actions are $z=[f,d,F,p]$ with :
$f$ : the leg frequency $\in R^4$
$d$ : the leg duty cycle $\in R^4$
$F$ : the ground reaction force (GRF) $\in R^{4 \times 3 \times N}$
$p$ : the foot touch down position $\in R^{4 \ŧimes 2 \times P}, with the $P$ the number of predicted foot step ($P < N$).
So it is for $p$ and $F$ that requires a solution to learn a prediction. For $f$ and $d$, it is not planned to make them vary along the prediction horizon, so not a problem.
Form of the predicted control actions
Two main possibilities have been identified so far for $F$ :
Learn explicitly a control action for each predicted time step → This means $N$ predicted control action, with only the first one actually applied to the system. The learning space would be $4 \times 3 \times N$ actions.
Learn a set of parameters to reconstruct a control action at each time step: The easiest form, would be a cubic spline with $M$ cubic interpolation. The learning space would be $4 \times 3 \times M$ actions, with $M < N$.
For $p$,
Learn explicitly the $P$ predicted step. The learning space would be $4 \times 3 \times P$
Learn a step offset and replicate it at each step. The learning space would be $4 \times 3$
How can the RL policy learn control actions along an horizon
This is a problem, since only the first action is applied, the other predicted actions don't have any influence over the simulation and thus the algorithm can't infer anything and there wouldn't be any learning. However, three way to modify the problem have been identified to resolve this issue.
Supervised Learning
Prediction mismatch penalized in the cost function
Centroidal model to query the RL policy for single step horizon along the prediction horizon
Supervised Learning
This probably the best approach but would also require the most work. We first learn a good policy for a single step horizon. This would be the teacher policy, trained with Reinforcement Learning. Then, we create a second policy for a multiple step horizon to mimic the teacher along that horizon. This student policy is trained with supervised learning.
This approach, doesn't really have any drawback expect the time of implementation. maybe because it learned from a learned policy, it could have some artefact.
Prediction mismatch penalized in the cost function
We learn a policy for a multiple time step horizon, with only the first predicted action applied to the system. However, we penalize the mismatch between the predicted action a time step > 1, with the actual action applied to the system at that time in the futur. For this, of course we can't penalize imediatly, we need to simulate the system and query new actions, before being able to penalize for the mismatch. This means for example that the policy a time $t=7$ will be penalized for an action of policy at $t=4$.
The implementation should be fairly simple, but the effectiveness is widely unknown.
Centroidal model to query the RL policy for single step horizon along the prediction horizon
We learn a policy for a single time step horizon. Then we use the centroïdal model to simulate the system and we can query the policy again to have actions at next time step, etc.
Again the implementation is fairly simple and the effectivness depends on the accuracy of the centroïdal model. A main drawback is that the observation space is limited to variable the centroïdal model compute. In practice this means, the joint position and velocity can't be used.
Admin Task
orbit
intodls_orbit
Model Base Task
The sampling controlller needs control actions along an horizon $T = N \cdot \Delta t$. This raises two questions :
For clarification, the control actions are $z=[f,d,F,p]$ with :
So it is for $p$ and $F$ that requires a solution to learn a prediction. For $f$ and $d$, it is not planned to make them vary along the prediction horizon, so not a problem.
Form of the predicted control actions
Two main possibilities have been identified so far for $F$ :
For $p$,
How can the RL policy learn control actions along an horizon
This is a problem, since only the first action is applied, the other predicted actions don't have any influence over the simulation and thus the algorithm can't infer anything and there wouldn't be any learning. However, three way to modify the problem have been identified to resolve this issue.
Supervised Learning
This probably the best approach but would also require the most work. We first learn a good policy for a single step horizon. This would be the teacher policy, trained with Reinforcement Learning. Then, we create a second policy for a multiple step horizon to mimic the teacher along that horizon. This student policy is trained with supervised learning. This approach, doesn't really have any drawback expect the time of implementation. maybe because it learned from a learned policy, it could have some artefact.
Prediction mismatch penalized in the cost function
We learn a policy for a multiple time step horizon, with only the first predicted action applied to the system. However, we penalize the mismatch between the predicted action a time step > 1, with the actual action applied to the system at that time in the futur. For this, of course we can't penalize imediatly, we need to simulate the system and query new actions, before being able to penalize for the mismatch. This means for example that the policy a time $t=7$ will be penalized for an action of policy at $t=4$. The implementation should be fairly simple, but the effectiveness is widely unknown.
Centroidal model to query the RL policy for single step horizon along the prediction horizon
We learn a policy for a single time step horizon. Then we use the centroïdal model to simulate the system and we can query the policy again to have actions at next time step, etc. Again the implementation is fairly simple and the effectivness depends on the accuracy of the centroïdal model. A main drawback is that the observation space is limited to variable the centroïdal model compute. In practice this means, the joint position and velocity can't be used.