Now, some history is provided to the observation.space (ie. input of the network). it is queried with the default function mdp.last_action.
This means, the acton.space is always into the observation.space.
Problem
If we change the type of network and the type of action.space. For example :
The expert policy without optimizer will output action='discrete', and time_horizon=1 which will results in an action.space.ndim=32
The student policy with optimizer will output action='spline', and time_horizon=15 which will results in an action.space.ndim=104.
This will then change the observation.space, due to the mdp.last_action being into the observation.space
Solution
Instead of putting blindly, the network output into the observation.space, one could put the actual output applied to the system into the observation space (eventually with an history size > 1). However, the actual output is into some kind of specific frame (in this case world frame) that may not be relevant for the network. From the output of the network a Normalisation $N(x)$ and a Transformation $T(x)$ are applied before having it into relavant space for the low level controller. Inverse transformation and normalisation should be applied to the actual output before beeing feed into the network observation.space.
[x] Define inverse Transformation $T^{-1}(x)$
[x] Define inverse Normalisation $N^{-1}(x)$
[x] Add $N^{-1}(T^{-1}(action_i))$ to the observation.space, with $action_i$ the last discrete action applied to the system.
[ ] (Optional Implement a rolling buffer for variable action history length > 1)
Define explicitly the Observation space
Context
Now, some history is provided to the
observation.space
(ie. input of the network). it is queried with the default functionmdp.last_action
.This means, the
acton.space
is always into theobservation.space
.Problem
If we change the type of network and the type of
action.space
. For example :action='discrete'
, andtime_horizon=1
which will results in anaction.space.ndim=32
action='spline'
, andtime_horizon=15
which will results in anaction.space.ndim=104
.This will then change the
observation.space
, due to themdp.last_action
being into theobservation.space
Solution
Instead of putting blindly, the network output into the
observation.space
, one could put the actual output applied to the system into the observation space (eventually with an history size > 1). However, the actual output is into some kind of specific frame (in this case world frame) that may not be relevant for the network. From the output of the network a Normalisation $N(x)$ and a Transformation $T(x)$ are applied before having it into relavant space for the low level controller. Inverse transformation and normalisation should be applied to the actual output before beeing feed into the networkobservation.space
.observation.space
, with $action_i$ the last discrete action applied to the system.