Batou1406 commented 5 months ago

Climb Terrain Curriculum - Can't progress to harder terrain

Context

The objective is to train a policy able to climb up and down stairs. For that, a terrain called STAIRS_TERRAINS_CFG has been prepared. It consist of successive pyramid (or inverse pyramids), called sub-terrains, with step size increansingly larger. The robot should learn how to traverse these sub-terrains. The command is mainly a forward speed, and we're not intersted in omnidirectionnal walking.

Parameters

Sub-terrain size : $10[m] \times 10[m]$
Forward Speed range : $\in [0.2 ; 0.5] [\frac{m}{s}]$
Small to no lateral and angular speed.

Problem

The robot struggle to progress to harder sub-terrain and is often stuck at the border between sub-terrain. After careful debugging, it doesn't seem to be a terrain issue (bug), but more a policy/configuration issue (tuning). It just didn't learn how to transition between sub-terrains. It make some sens, since if he was going downstairs, he then need to go upstairs, which for terrain with big step size, is quite a radical change. Almost all the weight is on the front during a descent and transitioning would require, very large torque/force and maybe some kind of manoeuvre that may not be beniefical in term of cost function. Please not that this is observed only in the pyramidal terrain. The inverse pyramid doesn't suffer as much of this problem, since it is easier to transition.

Moreover, another problem is how the curriculum is defined. In order to progress in the terrain there are two conditions :

Progress : If the robot travel more than 50% of the terrain distance (half of the length in x direction)
Regress : If the robot travel less than 50% of the required distance

The problem comes on how the progress is defined. 50% of the terrain distance, means simply that the robot has reached the sub-terrain border. Since, it struggles to transition between terrain this limit the progress. In other word. the robot has reached the terrain-border, so it has successfully made his way through the terrain, but due to how the curriculum is computed, it doesn't make it to an harder terrain. In addition, the maximum forward speed is set to $0.5[\frac{m}{s}]$, which means, it would reach the border in $10[s]$, and then be stuck for $10[s]$, given the total $20[s]$ duration of an episode. The average speed would then ~50% of the commanded speed, thus the robot may even regress eventhough he made it to the border. One may consider changing the episode duration or the maximal forward speed.

Finally, a last problem, is that the robot may progress in terrain, eventhough he falled. He could made a large progress in the terrain, just because he falled down the stairs. Advancing the difficulty in this case should be avoided.

Solution

Several solution exists to these problems, but may end-up in different behaviour.

1. Tuning the terrain confinguration to make the sub-terrains border traversable

One could put a flat border between terrains. This would ease the sub-terrain transition, but decrease also the problem difficulty, which may not be what we want.

2. Tune the cost function to make the robot traverse sub-terrain borders

However, this may be complex and not desiderable.

3. Change the progress and regress condition in the curriculum term

One could decrease the threshold of distance walked, so that the robot would progress when he is close to the boarder, and not only when he crossed it.

4. Change the commanded velocity

If the robot is told to go in diagonal, he could make a bigger progress before reaching the border, thus tweaking the problem.

4. Make robot that fall regress in terrain.

Straightforward and should be implemented.

5. Diminish the episodic length

This would avoid the problem of regressing because the robot was stuck at the border. Moreover, this may be beneficial for training wall clock time. This would be similar to increasing the sub-terrain size.

Batou1406 commented 5 months ago

Modification in the Terrain Curriculum

A flat border between terrains had been added -> easing the terrains border traversability
Threshold for progress have been decreased from 100% of the distance to the border to 80%.
Episodic length has been diminish from 20s to 12s but will be put to 15s.
The distance to progress in the terrain was computed as the 2D distance (XY world plane). However, in these kind of terrain with very large height difference between start and finish the 3D distance is more reprensentative of the progress.

New friction constraints

Previously, friction cone was used to compute a penalty for a violation that would results in slipping. However, I believed this wasn't has effective has intended. Instead, I'll try a new constraints that penalize for foot displacement (ie. foot speed), while the foot is supposed to be in contact (ie. when the model base variable 'c' is equal to 0).

New tracking reward

Previously, a reward was given for tracking a velocity in the xy plane. However, this was problematic since the change of slope in the terrain are so big, that in seemed for the robot that it needed to go into a wall. Instead, now reward is given for average velocity from the terrain origins. the velocity command is thus not really more needed

Batou1406 commented 5 months ago

FuM - Giulio's update

Add more terrain randomization

The way 'orbit' implemented the terrain curriculum is :

A discrete amount of terrain difficulty : let's assume 10.
Evaluate to the robot performance and decide to increase, decrease or keep constant the difficulty.
Spawn the robot in the terrain corresponding to the difficulty.
If the robot reached maximal difficulty, reset the difficulty randomly in [0, 10]

This means, that the robot won't see the easy terrain anymore as he progress in the curriculum and, if he' stuck at difficulty 5, would only see terrain of that difficlulty.

Instead, I propose antoher way to progress in the curriculum :

A discrete amount of terrain difficulty : let's assume 10.
If the robot was spawned in the terrain_difficulty == current_difficulty : Evaluate to the robot performance and decide to increase, decrease or keep constant the current_difficulty.
Sample randomly a terrain_difficulty based on the current_difficulty with the following law
- in 75% (p) of the cases : Sample terrain_difficulty with a uniform law in [0, current difficulty]
- in 25% (1-p) of the cases : set terrain_difficulty to current_difficulty (ie. the current maximal difficulty)
- Spawn the robot in the terrain corresponding to the sampled terrain_difficulty

This would lead in greater randomization in the terrain will maintening sufficient exploration in the higher difficulty terrain (modulated by the parameter p)

Friction Constraints

We agreed that the foot displacement (for leg in stance) penalty was better than the friction cone violation penalty since it doesn't really on friction coefficient µ. In addition, we didn't agree yet on if it was better to impose the constraints or only penalize for violation in the training (in deploy, it is sure we can impose the constraints) :

Penalty allows the models to learn a 'optimal' reaction when the foot is slipping and would be beneficial when it would happen (better generalisation). But may also allow the policy to exploit to an extend the simulator (worse for the sim-to-real transfer).
Enforce the constraints prevents the policy to exploit some kind of slipping (better for sim-to-real) but also prevents the policy to learn a reaction when the robot will slip (worse for generalisation).

However, actually enforcing the constrains is not straightforward and may be infeasible. For example, if the robot is going downward, with CoM above the two front leg. It can genearate a large $F_{xy}$ without violating the friction cone, which would make the two leg from behind slip.

Finally, it has been decided to implement the function to enforce the constraints and train two policy to evaluate the differences.

Tracking reward : Velocity or position : how to do it ?

Open question that needs answer

Batou1406 commented 5 months ago

Friction cone constraint has been implemented and tested over some values. As expected it doesn't fully prevent leg from slipping

Batou1406 commented 5 months ago

Tracking reward : Velocity or position : how to do it ?

We stick to velocity command, but try to allow a wider range for maximal reward (like a plateau), to allow some flexibility in the speed tracking. The size of the plateau should vary with the terrain difficulty

Batou1406 commented 5 months ago

Tracking Reward : Velocity or Position : How to do it ?

We've decided to stick to to velocity tracking since this is the easiest for the problem definition, and we did not found any good formulation with position tracking.

However, we decided to relax the constraints on speed tracking proportionnally to the terrain difficulty and required speed. For this I created I new reward function.

Originally, the velocity was tracked with an exponential kernel, which give a reward of 1, if robot's speed = desired speed, and has an exponential decay to 0 with $std$ speed as they diverge.

New soft expoenential kernel

This new kernel aim to relax the constraint on speed tracking, and allow the robot to obtain maximal reward for a larger range of speed. For some kind of tolerance the robot would be able to obtain the maximum reward, which should give him more freedom with speed tracking for challenging obstacles.

Tolerance

The new function crafted aim to relax the constraints accroding to terrain difficulty and the commanded speed. For this we define the parameter tolerance : $$tolerance = \alpha \cdot || \vec v_{desired}|| \cdot difficulty$$ with alpha a tuning parameter.

Relaxing direction

Then, we aim to relax the constraint on the robot speed only in the direction of the desired speed. For that we project the robot speed on the desired speed. $$\vec v_{rob,xy} = (v_x, vy) \to v{rob,x'y'} = (v_x', vy')$$ With $(x', y')$ new axis, with $x'$ parallel to $\vec v{cmd}$$ and $y'$ perpendicular to $\vec v_{cmd}$

With this new formulation, we can compute the speed tracking error as two term :

Forward speed tracking error : $|| \vec v{rob}|| \cdot cos(\theta) - || \vec v{cmd}||$
Lateral speed tracking error : $|| \vec v_{rob}|| \cdot sin(\theta)$

With $\theta$ the angle between $\vec v{cmd}$ and $\vec v{rob}$

Relaxing

With the tolerance parameter and the relaxing direction, one can then relax the constraint on the forward speed tracking error, with a piecewise function :

$f(x) = x$, for $x > 0$
$f(x) = 0$, for $x < 0, x > -tol$
$f(x) = x + tol$, for $x < -tol$

Bringing everything together

Finally, one can compute the expoential kernel normally : $$e^{-\frac{relaxforwardspeederror^2 + lateralspeederror^2}{std^2}}$$

This function has the benefit to remain continously and differentiable on $R^2$

Visualization

Function visualization

Batou1406 commented 5 months ago

Update

Fixed mistakes in soft exponential kernel computations : results are now coherent
Better vizualisation + slidder for intuitive comprehension of the soft exponential kernel https://www.geogebra.org/calculator/kpttq46a

Batou1406 commented 5 months ago

Task Results

I trained a new policy fewer weight and the latest implementation → And it works very well ! The weight are :

Track XY linear (base) velocity (with soft exponential kernel
Track Z angular (base) velocity (with exponential kernel)
Penalty Z linear (base) velocity (L2 kernel)
Penalty XY angular (base) velocity (L2 kernel)
Penalty Torque (L2 kernel)
Penalty Stance foot velocity (L2 kernel)
Penalty 'thigh' contact (Linear in number of contact kernel)

good_climb_few_w2

good_climb_few_w2_2

climbup1

Curriculum Improvement

However, the curriculum kind of stop arround level 4-5 and I believe it could do better !

Maybe, it is just not progressing sufficiently quickly in the terrain, given the episode length, for it to reach the sucess condition. This may make sense, since with harder terrain and the soft kernel, more flexibility is given on the speed tracking, and it will indeed go a bit slower. Also the terrain are quite big wrt to 'standard'.

One option would it to make it progress to harder terrain as long as it don't fall. Or make it progress after a shorter distance traveled

Batou1406 commented 4 months ago

There was a mistake in the way I sampled between random terrain difficulty and max difficulty. Now this should have been fixed.

Moreover, I changed the curriculum threshold for the climb terrain. Now :

Increase difficuly if :

Cleared less than 3 stairs

Decrease difficulty if :

Cleared less than 0 stairs (didn't move further than the terrain base)
The robot reset for other reason than timeout (fall or base contact)

Batou1406 / dls_orbit_bat_private