Batou1406 / dls_orbit_bat_private

Unified framework for robot learning built on NVIDIA Isaac Sim
https://isaac-orbit.github.io/orbit/
Other
1 stars 0 forks source link

Follow up With Giulio #17

Closed Batou1406 closed 2 months ago

Batou1406 commented 4 months ago

Admin Task

Model Base Task

The sampling controlller needs control actions along an horizon $T = N \cdot \Delta t$. This raises two questions :

For clarification, the control actions are $z=[f,d,F,p]$ with :

So it is for $p$ and $F$ that requires a solution to learn a prediction. For $f$ and $d$, it is not planned to make them vary along the prediction horizon, so not a problem.

Form of the predicted control actions

Two main possibilities have been identified so far for $F$ :

For $p$,

How can the RL policy learn control actions along an horizon

This is a problem, since only the first action is applied, the other predicted actions don't have any influence over the simulation and thus the algorithm can't infer anything and there wouldn't be any learning. However, three way to modify the problem have been identified to resolve this issue.

  1. Supervised Learning
  2. Prediction mismatch penalized in the cost function
  3. Centroidal model to query the RL policy for single step horizon along the prediction horizon

Supervised Learning

This probably the best approach but would also require the most work. We first learn a good policy for a single step horizon. This would be the teacher policy, trained with Reinforcement Learning. Then, we create a second policy for a multiple step horizon to mimic the teacher along that horizon. This student policy is trained with supervised learning. This approach, doesn't really have any drawback expect the time of implementation. maybe because it learned from a learned policy, it could have some artefact.

Prediction mismatch penalized in the cost function

We learn a policy for a multiple time step horizon, with only the first predicted action applied to the system. However, we penalize the mismatch between the predicted action a time step > 1, with the actual action applied to the system at that time in the futur. For this, of course we can't penalize imediatly, we need to simulate the system and query new actions, before being able to penalize for the mismatch. This means for example that the policy a time $t=7$ will be penalized for an action of policy at $t=4$. The implementation should be fairly simple, but the effectiveness is widely unknown.

Centroidal model to query the RL policy for single step horizon along the prediction horizon

We learn a policy for a single time step horizon. Then we use the centroïdal model to simulate the system and we can query the policy again to have actions at next time step, etc. Again the implementation is fairly simple and the effectivness depends on the accuracy of the centroïdal model. A main drawback is that the observation space is limited to variable the centroïdal model compute. In practice this means, the joint position and velocity can't be used.