How to convet qpos to actions?

Hi, Thanks for your sharing. I am very sorry to bother you so many times, and I have been following this work for a long time.

My work requires to explicitly estimate the residual of fine-tuning from the current pose, so I need to convert qpos into action. Your previous reply said that the robot in then env is position controlled, so I tried to select the body-related variables in qpos as actions, but it doesn't seem right.

In order to achieve this function (qpos -> action), I first execute some actions, record data, and then execute actions generated by qpos, as follows,

    # * Dims related to body parts in self._env.named.data.qpos.
    action = torch.zeros((61))
    action[:16] = ob['proprio'][:16]
    action[40:45] = ob['proprio'][40:45]

I also tried adding normalize_action()/unnormalize_action(), but it seems not right, because the two simulation results are very different.

    action = torch.zeros((61))
    action[:16] = ob['proprio'][:16]
    action[40:45] = ob['proprio'][40:45]
    # * action = env.task.task.unnormalize_action(action)
    action = env.task.task.normalize_action(action)

Test settings：

    parser.add_argument("--policy_type", default=None)
    parser.add_argument('--obs_wrapper', default='True')
    parser.add_argument('--sensors', default='image')

So the test env use ObservationWrapper, and the action is unnormalized before executed. https://github.com/carlosferrazza/humanoid-bench/blob/376aec79a85ea5df855d1e0e72b00c750baafd7b/humanoid_bench/wrappers.py#L564-L565 https://github.com/carlosferrazza/humanoid-bench/blob/376aec79a85ea5df855d1e0e72b00c750baafd7b/humanoid_bench/tasks.py#L59-L61 I think the qpos-action need to be normalized (as the oppsite of unnormalize), but the visualized results of unormalize_action seems better than normalize_action. This confuses me.

A frame of my test seq. (Frame 05) Src action.

Directly-qpos

Normalized-qpos

Unnormalized-qpos

I have no idea about:

How to convert qpos to actions? Or which one is more reasonable, (directly-qpos, normalized-qpos, or unnormalize-qpos)? Or any other advices?
Is the difference between executing directly_qpos and action due to gravity and some random nature of the simulator? (It seems that: when an action is executed, the simulation is not until the robot and objects are stationary, it only simluate a certain steps. When performing the same action repeatedly, it usually visualizes a gradual fall.)

Thanks for your any reply.

Best regards.

Hi! I think you are getting confused by the order of joints and actuators in the xml. Try the following code in the test_env.py file:

action_raw = np.zeros(env.action_space.shape)
joint_to_action = {}
offset = 0
for i in range(env.model.njnt):
    print(f"joint {i}: {env.model.joint(i).name}")
    jnt_name = env.model.joint(i).name
    if jnt_name.startswith("free"):
        joint_to_action[env.model.joint(i).name] = range(i + offset, i + offset + 7)
        offset += 6
    else:
        joint_to_action[env.model.joint(i).name] = i + offset

for i in range(env.model.nu):
    print(f"actuator {i}: {env.model.actuator(i).name}")
    act_name = env.model.actuator(i).name
    if act_name.startswith("lh") or act_name.startswith("rh"):
        action_raw[i] = 0.0
    else:
        action_raw[i] = env.data.qpos[joint_to_action[act_name]]

Then, you just need to normalize:

action = env.task.normalize_action(action_raw)

Note that I am setting the hand joints to zero above.

Hi, Thank you for your reply. I have test you code and find that difference between joints order and actuators order (i.e., right arm part).

action_src -> env -> qpos -> action_rec (recoverd from qpos)

With your code, action_rec is almost same as action_src, e.g.,

action_rec
[ 0.006953  0.002198 -0.031055 -0.078634 -0.321824  0.017591  0.017056
 -0.032843 -0.076613 -0.335486  0.001403  0.00309  -0.802331 -0.550045
 -0.353914 -0.828963 -0.002263  0.804218  0.545445 -0.350416 -0.816377
  0.5       0.176471  0.       -1.        0.        0.       -0.714287
  0.       -0.714287 -1.        0.       -0.714287 -1.        0.
 -0.714287 -1.       -1.        0.       -0.714287 -1.        0.5
  0.176471  0.       -1.        0.        0.       -0.714287  0.
 -0.714287 -1.        0.       -0.714287 -1.        0.       -0.714287
 -1.       -1.        0.       -0.714287 -1.      ]

action_src
[ 0.00618   0.002033 -0.031263 -0.078371 -0.321913  0.017153  0.016218
 -0.031071 -0.086236 -0.326936  0.001373  0.003056 -0.802221 -0.550039
 -0.354205 -0.828961 -0.00232   0.804327  0.545481 -0.350839 -0.816365
  0.5       0.176471  0.       -1.        0.        0.       -0.714287
  0.       -0.714287 -1.        0.       -0.714287 -1.        0.
 -0.714287 -1.       -1.        0.       -0.714287 -1.        0.5
  0.176471  0.       -1.        0.        0.       -0.714287  0.
 -0.714287 -1.        0.       -0.714287 -1.        0.       -0.714287
 -1.       -1.        0.       -0.714287 -1.      ]

But for actions random selected, they are different,

action_src = env.action_space.sample()
action_src[21:] = 0  # * Set 0 for hands pos.

action_rec
[ 0.001173  0.051781 -0.027568 -0.031047 -0.427927  0.218858 -0.009215
 -0.03604  -0.111994 -0.286872  0.061891 -0.00056  -0.791568 -0.553081
 -0.333942 -0.84509  -0.003077  0.780766  0.539807 -0.333298 -0.775451
  0.5       0.176471  0.       -1.        0.        0.       -0.714287
  0.       -0.714287 -1.        0.       -0.714287 -1.        0.
 -0.714287 -1.       -1.        0.       -0.714287 -1.        0.5
  0.176471  0.       -1.        0.        0.       -0.714287  0.
 -0.714287 -1.        0.       -0.714287 -1.        0.       -0.714287
 -1.       -1.        0.       -0.714287 -1.      ]

action_rec_wo_normalize
[ 0.000504  0.022266 -0.383155  0.859141 -0.47241   0.094109 -0.003962
 -0.407172  0.765647 -0.374376  0.145444 -0.001606  0.019545 -0.015109
  0.035492 -0.016778 -0.00883  -0.038179 -0.023056  0.036735  0.043112
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.      ]

action_src
[-0.771802 -0.42334   0.136083  0.187762 -0.37183   0.73146   0.620496
 -0.219878 -0.832228  0.335425  0.427085  0.967805 -0.573363  0.306545
  0.437086 -0.923811 -0.961498 -0.128054 -0.146651  0.32453  -0.109606
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.      ]

I also test random select actions and normalize them before execute, they still different,

action_src = env.action_space.sample()
action_src[21:] = 0
action_src = env.task.normalize_action(action)

action_rec
[ 0.107697 -0.016432  0.005941 -0.224225 -0.207612  0.193516 -0.042482
  0.018831 -0.270997 -0.214824 -0.01888   0.014548 -0.793247 -0.556336
 -0.335144 -0.85398  -0.008045  0.819778  0.555294 -0.333    -0.802578
  0.5       0.176471  0.       -1.        0.        0.       -0.714287
  0.       -0.714287 -1.        0.       -0.714287 -1.        0.
 -0.714287 -1.       -1.        0.       -0.714287 -1.        0.5
  0.176471  0.       -1.        0.        0.       -0.714287  0.
 -0.714287 -1.        0.       -0.714287 -1.        0.       -0.714287
 -1.       -1.        0.       -0.714287 -1.      ]

action_src
[ 2.051246 -0.176114  0.136142 -1.52508   0.544188  1.317612 -0.113427
  0.443373 -1.526157 -0.269055 -0.146196  0.122162 -0.744831 -0.780355
  0.128492 -1.786527 -0.065511  1.003311  0.624341 -0.389015 -0.638936
  0.5       0.176471  0.       -1.        0.        0.       -0.714287
  0.       -0.714287 -1.        0.       -0.714287 -1.        0.
 -0.714287 -1.       -1.        0.       -0.714287 -1.        0.5
  0.176471  0.       -1.        0.        0.       -0.714287  0.
 -0.714287 -1.        0.       -0.714287 -1.        0.       -0.714287
 -1.       -1.        0.       -0.714287 -1.      ]

Two questions:

Is this related to the ctrlrange of actuator in xml ?
By exeucting same actions generated with your code, the robot does not stand still and gradually fall down (its knees gradually bent and its body fell backwards.), is this due to gravity and some actuators not being able to reach their intended position?

Thanks for your any reply. Best regards.

The code I posted above is the correct way to control the body qpos at their current positions. The initial keyframe is not passively stable, that is why the robot eventually falls down.

carlosferrazza / humanoid-bench

How to convet qpos to actions? #21