Closed Jialn closed 5 years ago
some high-light experiments:
tested smaller batch size(4096->256), which makes the training curve more smooth, and consumes less time (24h v.s. 7.5h for 2M env steps) so we choose 256 batch size
re-sample action once for all actor, Q and alpha training(now version of alf sac implementation) v.s. re-sample 3 times for each of them (similar to tf-agent sac implementation) re-sample for each of them seems a little bit better given the same env steps but consumes more time (7.5h v.s. 9.5h) can not say which is better.
From the behavior I can see in alf-sac case the agent get balanced very fast but not walk forward. Seems the training progress was trapped in "local minimum". So I increase learning rate by 4x and 10x (5e-4 -> 2e-3, 5e-3): A lr of 2e-3 has similar final return to tf-agent one, And the learning procedure is much faster.
I don't know why but it seems alf-sac has a balance-first preference, thus requires higher learning rate than tf-agent in icub-walking task.
Actually similar behavior has been found in alf-sac experiment of pybullet humanoid case. The agent get balanced very fast, althoug finally it could still walking, in a "小碎步" way, unlike PPO. I guess there might be some unkown different implementation in entropy regulation we haven't noticed yet.
Add experiments about replay buffer size. Smaller memory size could limit "off-policyness", which helps lower the sampling possibility of the early-stage "un-useful" samples that would never occur for a well-trained policy, thus make the training progress faster. Also the 1M replay buffer we are using now require too high memory( > 90G). So I've tested 100K and 10K replay buffer size:
So we can see: training speed: 10k > 100K > 1M final performance: 10K < 100K ~= 1M
Smaller memory size does benefit the training progress. Distribution of state-action pair shifts gradually according to the improving policy as we abandon old samples. Although Q network is meant to be ∀ (s, a), in practice it only need to fit the distribution limited by the "on-going well trained" policy and environment dynamics. But replay buffer still have to keep enough space for the "less-bad" state-action pairs nearby the distribution of "good" one. Those "less-bad" state-action pairs might serve like a low-value bounds, that prevent the policy from choosing the actions beyond it, which could be not so well-generalized by the Q network (and not well-generalized states consequently).
I don't know why but it seems alf-sac has a balance-first preference, thus requires higher learning rate than tf-agent in icub-walking task.
Actually similar behavior has been found in alf-sac experiment of pybullet humanoid case. The agent get balanced very fast, althoug finally it could still walking, in a "小碎步" way, unlike PPO. I guess there might be some unkown different implementation in entropy regulation we haven't noticed yet.
As far as i know, there are 2 things different between sac implemented in alf and tfa
(1) sample action only once for all networks in alf but re-sampled for each network in tfa
(2) alf calc loss for all steps in sampled trajectories or transitions , let mini_batch_length=2
,( use sampled transitions (time_steps, next_time_steps) to train )
critic loss in alf is half of that in tfa (for alf , critic loss for next_time_steps is 0, andtf.reduce_mean
was applied to the loss);
we use time_steps and next_time_steps to calc actor loss and alpha loss but only use time_steps in tfa;
so, it learns critic at 0.5x and actor at 2x, alpha at 2x in alf
i tried to make them same, but they still behave differently
the major difference is that it did not mask last step when calc losses in tf-agents, alf did the right things
i have test with
and they make consistent performance
so, it learns critic at 0.5x and actor at 2x, alpha at 2x in alf
i tried to make them same, but they still behave differently
I think because of reduce_mean(), actor and alpha are still learned at 1x, although they use both time_steps and next_time_steps.
the major difference is that it did not mask last step when calc losses in tf-agents, alf did the right things
So this is a bug of tf-agents? Did you test how big the effect is?
i have test with
- re-sample action for calc actor, critic, alpha losses in alf
- use only time_steps for calc actor and alpha loss in alf (do not use next_time_steps in sampled transitions )
- mask last steps for calc losses in tfa
and they make consistent performance
Did you mean that with these changes, alf and tf-agents get same result? What if you only do point 3?
Add experiments about replay buffer size. Smaller memory size could limit "off-policyness", which helps lower the sampling possibility of the early-stage "un-useful" samples that would never occur for a well-trained policy, thus make the training progress faster. Also the 1M replay buffer we are using now require too high memory( > 90G). So I've tested 100K and 10K replay buffer size:
90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?
90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?
Sorry, my mistake, "used" of "free -m" included other tasks that I forgot was running. Memory usage recorded by tensor-board is 20GB for 1M replay buffer size, 2.8GB for 100K, and 1.7GB for 10K:
Masking last step can have big effect in early-stage training, as the average episode length is very short (~10 steps) due to the robot can not keep balanced in early-stage.
That could help explain why tfa-sac not so eager to keep balanced. Without mask, states about to fall down lead to a next state of initial state which have relative higher Q value, so they are not the very low-value zone for tfa-sac. That happens to encourage the agent to leave initial balanced state and walk away.
So this is a bug of tf-agents? Did you test how big the effect is?
sac
, ddpg
, td3
implemented in tf-agents don't mask last steps and work fine,
i have not test the effect , and the testes can only explain why the performance is different for the task using sac in alf and tfa
Did you mean that with these changes, alf and tf-agents get same result?
yes ,they get same result
What if you only do point 3?
they get almost the same result (losses curve ) and tests shows point 1 has little effect on the training progress ; and point 2 can be adapt by adjust learning-rates
- re-sample action for calc actor, critic, alpha losses in alf
- use only time_steps for calc actor and alpha loss in alf (do not use next_time_steps in sampled transitions )
- mask last steps for calc losses in tfa
@witwolf Can you document these differences and their effects in ALF code?
90GB memory seems outrageous. @witwolf, do you have any idea why so much memory is used?
Sorry, my mistake, "used" of "free -m" included other tasks that I forgot was running. Memory usage recorded by tensor-board is 20GB for 1M replay buffer size, 2.8GB for 100K, and 1.7GB for 10K:
20GM for 1M replay buffer size is still not much. This means 20KB for each step. My estimation is about 1KB/step. observation is 170 dimension, action is 25 dimension. There are 1 observation, 2 action, and 1 action distribution in the experience, so the size is about (170+254)4=1080 bytes.
Oh, the replay buffer capacity is actually max_length num_parallel_environments. So if max_length=1M. Memory should be 108012*1M=13GB. Still less than 20GB
Sorry, just remember that the 20GB memory usage above is from the exp alf_sac_1_4096_batch_size, which was copied form an earlier experiment using 16 parallel envs. I haven't do that again with 12 parallel environments to save some CO2 and forgot the thing ...
All other experiments using 12 parallel envs and 1M max_length takes about 14.6GB-14.7GB, which is 13GB more than 1.7GB, the case of 10K(can be ignored) replay buffer size.
So this is very similar to the calculation. The thread without replay buffer takes 1.6GB~1.7GB. Seems more reasonable.
Changes:
The difference is the learning rate. I'don't know why, but it seems alf-sac requires higher learning rate than tf-agent in icub-walking task.