PR: Add Alf Sac example for icub-walking task

Jialn commented 5 years ago

Changes:

Update gin file to fit the new TrainerConfig
Update parparmeters to be the same as tf-agent one(As far as I can see, the gin file at commit 97e2253 is the most similar as tf-agent one. But the performance is not good enough)
After some experiment the newest gin conf (39cf76a) has similar final return to tf-agent one, And the learning procedure is much faster.

The difference is the learning rate. I'don't know why, but it seems alf-sac requires higher learning rate than tf-agent in icub-walking task.

Jialn commented 5 years ago

some high-light experiments:

tested smaller batch size(4096->256), which makes the training curve more smooth, and consumes less time (24h v.s. 7.5h for 2M env steps) so we choose 256 batch size
re-sample action once for all actor, Q and alpha training(now version of alf sac implementation) v.s. re-sample 3 times for each of them (similar to tf-agent sac implementation) re-sample for each of them seems a little bit better given the same env steps but consumes more time (7.5h v.s. 9.5h) can not say which is better.

As far as i know, I can not find the difference between this version of alf-sac (256batch size and re-sample 3 times for all) and tf-agent SAC , but the performance has a huge gap:

From the behavior I can see in alf-sac case the agent get balanced very fast but not walk forward. Seems the training progress was trapped in "local minimum". So I increase learning rate by 4x and 10x (5e-4 -> 2e-3, 5e-3): A lr of 2e-3 has similar final return to tf-agent one, And the learning procedure is much faster.

2e-3 is better for Alf-sac, could this also benifit tf-agent-sac ? So I added a 2e-3 version of tf-agent as a comparison (icub_tfagentsac_lr_2en3 in the above pic). Apparently such a high lr does not work for tf-agent-sac.

Jialn commented 5 years ago

I don't know why but it seems alf-sac has a balance-first preference, thus requires higher learning rate than tf-agent in icub-walking task.

Actually similar behavior has been found in alf-sac experiment of pybullet humanoid case. The agent get balanced very fast, althoug finally it could still walking, in a "小碎步" way, unlike PPO. I guess there might be some unkown different implementation in entropy regulation we haven't noticed yet.

Jialn commented 5 years ago

Add experiments about replay buffer size. Smaller memory size could limit "off-policyness", which helps lower the sampling possibility of the early-stage "un-useful" samples that would never occur for a well-trained policy, thus make the training progress faster. Also the 1M replay buffer we are using now require too high memory( > 90G). So I've tested 100K and 10K replay buffer size:

So we can see: training speed: 10k > 100K > 1M final performance: 10K < 100K ~= 1M

Smaller memory size does benefit the training progress. Distribution of state-action pair shifts gradually according to the improving policy as we abandon old samples. Although Q network is meant to be ∀ (s, a), in practice it only need to fit the distribution limited by the "on-going well trained" policy and environment dynamics. But replay buffer still have to keep enough space for the "less-bad" state-action pairs nearby the distribution of "good" one. Those "less-bad" state-action pairs might serve like a low-value bounds, that prevent the policy from choosing the actions beyond it, which could be not so well-generalized by the Q network (and not well-generalized states consequently).

witwolf commented 5 years ago

I don't know why but it seems alf-sac has a balance-first preference, thus requires higher learning rate than tf-agent in icub-walking task.

Actually similar behavior has been found in alf-sac experiment of pybullet humanoid case. The agent get balanced very fast, althoug finally it could still walking, in a "小碎步" way, unlike PPO. I guess there might be some unkown different implementation in entropy regulation we haven't noticed yet.

As far as i know, there are 2 things different between sac implemented in alf and tfa

(1) sample action only once for all networks in alf but re-sampled for each network in tfa

(2) alf calc loss for all steps in sampled trajectories or transitions , let mini_batch_length=2 ,( use sampled transitions (time_steps, next_time_steps) to train )

critic loss in alf is half of that in tfa (for alf , critic loss for next_time_steps is 0, andtf.reduce_mean was applied to the loss);

we use time_steps and next_time_steps to calc actor loss and alpha loss but only use time_steps in tfa;

so, it learns critic at 0.5x and actor at 2x, alpha at 2x in alf

i tried to make them same, but they still behave differently

witwolf commented 5 years ago

the major difference is that it did not mask last step when calc losses in tf-agents, alf did the right things

i have test with

re-sample action for calc actor, critic, alpha losses in alf
use only time_steps for calc actor and alpha loss in alf (do not use next_time_steps in sampled transitions )
mask last steps for calc losses in tfa

and they make consistent performance

emailweixu commented 5 years ago

so, it learns critic at 0.5x and actor at 2x, alpha at 2x in alf

i tried to make them same, but they still behave differently

I think because of reduce_mean(), actor and alpha are still learned at 1x, although they use both time_steps and next_time_steps.

emailweixu commented 5 years ago

the major difference is that it did not mask last step when calc losses in tf-agents, alf did the right things

So this is a bug of tf-agents? Did you test how big the effect is?

i have test with

re-sample action for calc actor, critic, alpha losses in alf

use only time_steps for calc actor and alpha loss in alf (do not use next_time_steps in sampled transitions )

mask last steps for calc losses in tfa

and they make consistent performance

Did you mean that with these changes, alf and tf-agents get same result? What if you only do point 3?

emailweixu commented 5 years ago

Add experiments about replay buffer size. Smaller memory size could limit "off-policyness", which helps lower the sampling possibility of the early-stage "un-useful" samples that would never occur for a well-trained policy, thus make the training progress faster. Also the 1M replay buffer we are using now require too high memory( > 90G). So I've tested 100K and 10K replay buffer size:

90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?

Jialn commented 5 years ago

90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?

Sorry, my mistake, "used" of "free -m" included other tasks that I forgot was running. Memory usage recorded by tensor-board is 20GB for 1M replay buffer size, 2.8GB for 100K, and 1.7GB for 10K:

Jialn commented 5 years ago

Masking last step can have big effect in early-stage training, as the average episode length is very short (~10 steps) due to the robot can not keep balanced in early-stage.

That could help explain why tfa-sac not so eager to keep balanced. Without mask, states about to fall down lead to a next state of initial state which have relative higher Q value, so they are not the very low-value zone for tfa-sac. That happens to encourage the agent to leave initial balanced state and walk away.

witwolf commented 5 years ago

So this is a bug of tf-agents? Did you test how big the effect is?

sac, ddpg, td3 implemented in tf-agents don't mask last steps and work fine, i have not test the effect , and the testes can only explain why the performance is different for the task using sac in alf and tfa

Did you mean that with these changes, alf and tf-agents get same result?

yes ,they get same result

What if you only do point 3?

they get almost the same result (losses curve ) and tests shows point 1 has little effect on the training progress ; and point 2 can be adapt by adjust learning-rates

emailweixu commented 5 years ago

re-sample action for calc actor, critic, alpha losses in alf

use only time_steps for calc actor and alpha loss in alf (do not use next_time_steps in sampled transitions )

mask last steps for calc losses in tfa

@witwolf Can you document these differences and their effects in ALF code?

emailweixu commented 5 years ago

90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?

Sorry, my mistake, "used" of "free -m" included other tasks that I forgot was running. Memory usage recorded by tensor-board is 20GB for 1M replay buffer size, 2.8GB for 100K, and 1.7GB for 10K:

20GM for 1M replay buffer size is still not much. This means 20KB for each step. My estimation is about 1KB/step. observation is 170 dimension, action is 25 dimension. There are 1 observation, 2 action, and 1 action distribution in the experience, so the size is about (170+254)4=1080 bytes.

emailweixu commented 5 years ago

Oh, the replay buffer capacity is actually max_length num_parallel_environments. So if max_length=1M. Memory should be 108012*1M=13GB. Still less than 20GB

Jialn commented 5 years ago

Sorry, just remember that the 20GB memory usage above is from the exp alf_sac_1_4096_batch_size, which was copied form an earlier experiment using 16 parallel envs. I haven't do that again with 12 parallel environments to save some CO2 and forgot the thing ...

All other experiments using 12 parallel envs and 1M max_length takes about 14.6GB-14.7GB, which is 13GB more than 1.7GB, the case of 10K(can be ignored) replay buffer size.

So this is very similar to the calculation. The thread without replay buffer takes 1.6GB~1.7GB. Seems more reasonable.

HorizonRobotics / SocialRobot

PR: Add Alf Sac example for icub-walking task #67