questions - Githubissues

codingliuyg commented 4 years ago

hello，is there implement code with python for ’Remember and Forget for Experience Replay Supplementary Material‘, I had trouble with the gradient calculation.Is it right for me to compute the gradient one by one? Looking forward to your reply，thanks a lot.

novatig commented 4 years ago

Hello,

I haven't yet implemented ReF-ER in pytorch/tensorflow. I have a small, private for now, pytorch repo of simple RL algorithms, which I used to teach a workshop. I will add ReF-ER and share it after I defend my thesis, in a couple of months. In the meantime...

I had trouble with the gradient calculation. Is it right for me to compute the gradient one by one?

Do you mean computing the gradient for one sample of the mini-batch at the time? No, you do not. The default approach in pytorch/tensorflow of performing all operations with multi-dimensional tensors works fine for ReF-ER. In fact, ReF-ER just applies the techniques of PPO to off-policy RL. PPO has been re-implemented multiple times in both pytorch/tensorflow.

With the disclamer that I never implemented it myself, my first guess on how I would implement ReF-ER in pytorch/tensorflow would be:

implement the RL algorithm as normal
modify the loss by adding the weighted KL penalty
detach/tf.stop_gradient the off-policy importance weights used for the rejection of samples
for the elements in the mini-batch with importance weights out of bounds, mask the loss by zero

I know that implementations of PPO use torch.clamp or tf.clip_by_value but that only works for the off-policy policy gradient. Let me know how it goes!

codingliuyg commented 4 years ago

Sorry to reply you so late，I'm working on the code these days，tank you so much for your advice.After following your advice，the code works fine 。 There are another two quick questions for me。 1、Is it right for me to compute the mean of the ρti on the axis=1 when then action dim is not 1,such as env 'Humanoid-v2',action dim is 17. 2、It is time-consuming for nfar computing （the number of far-policy samples） each step， which about 0.146s each step。Because i have to go through the Experience Replay to check whether the ρt >cmax or not. Do you have any good Suggestions？Looking forward to your reply。 Thanks again for your reply！

novatig commented 4 years ago

Hello,

I did not understand point 1. What would be axis=1? What would be axis=0? Why would you compute the mean of the importance weights?

Regarding point 2, I assume this is about updating the penalization coefficient. In the paper, I wrote that I store the most-recently computed importance ratio for each experience in the RM. Each time an experience is sampled for a mini-batch grad update, the associated importance sample is updated.

I do something similar also to employ Retrace without having to train on episodes rather than on steps.

codingliuyg commented 4 years ago

hello， for point 1， What i mean is my importance ratio shape is (128,17) when the env is 'Humanoid-v2' and mini batch size is 128. Then should i convert the importance ratio's shape into (128,1) by computing the mean of the importance weights? My understanding is that one sample corresponds to a importance ratio。 My code is： self.a_new_noise_policy = self.norm_dist # πw(a|s) self.a_old_noise_policy = tf.distributions.Normal(self.u_mean, self.u_sigma) # μt(at|st) self.ratio = tf.reduce_mean(self.a_new_noise_policy.prob(self.action) / (self.a_old_noise_policy.prob(self.action) + 1e-5), axis=1) Is there anything wrong with the way I calculated it？Or how to implement this formula： ρt=πw(at|st)/μt(at|st) for point 2， I still don't quite understand your handling of it，I also stored the most-recently computed importance ratio for each experience in the RM。But i have to compare ρi with then c_max and 1/cmax 300000 times when the RM size is 300000. thank you！

novatig commented 4 years ago

1) What version of tf are you using? I managed to run part of your commands on 1.14. There, tf.distributions.Normal(means, stdevs) inherits the shape from the input tensors. Be careful, as far as I understand, you are dealing with a batch of probability distributions, not multivariate distributions. In RL we usually assume diagonal covariances, therefore independently sampled action vector components. Therefore, the probability of an action vector is the product of the probabilities of each component, not the mean. Also, be careful if you add 1e-5 to the denominator. It may work for 1D action problems, but it for sure will not work for high dimensional action spaces.

2) Sorry, I did not think it through. When I update the stored importance weights, I also figure out whether the sample was "far policy" before the update and whether it is after the update ( https://github.com/cselab/smarties/blob/master/source/ReplayMemory/Episode.h#L175 ). I use that to update the counter of far policy experiences (nFarPolicySteps += isFarPolicy - wasFarPolicy) Because smarties is C++, I use atomics and all this is computationally negligible.

codingliuyg commented 4 years ago

hello novatig， I still have a few questions I'm not clear about。 1、As shown in figure 1，The parameters（beta,c_max,far policy rate,lr of policy net） appear to vary normally.But as shown in figure 2,Cumulative rewards rise slowly。Do you have any Suggestions for this？Did you set the env unwrapped？ Fig 1 Fig 2

2、Why is there N_start rather than N（the max number of replay buffer），and why should i remove a episode rather than a step。

3、For DDPG, i don't need V w(st) here，right？（ {st, rt, at, μt, V w(st)}） For example，one sample data is like {st,rt,at,ut},It doesn't need a full episode in training，isn‘t it？

4、For DDPG,the noise std is 0.2,is it fixed or needs training？I keep it at 0.2 at the moment.

Looking forward to your reply。 thank you！

novatig commented 4 years ago

Hi, sorry I missed this. 1) I cannot replicate your results so I really can't say. I don't even know what problem it is. Try using smarties on your problem and see what returns you observe. 2) Thanks! That's a typo, it should be n_obs > N. In my code experiences are stored as episodes (in order to easily compute GAE or retrace, or use RNN for example) so that's what I do. 3) Yes, you do not need episodes nor V for DDPG. 4) Yes, it's fixed.

codingliuyg commented 4 years ago

Thank you very much for your reply！ 1、My immediate question is ：In order to ensure a 10% far policy ratio, My beta value is maintained at 0.1 to 0.2.My understanding is that 80% of the model is used to maintain far Policy ratio，and 20% for Actor/Critic learning.So the reward rises especially slowly. Can you show me your beta trend of DDPG?My env is 'Humanoid-v2'.

2、For DDPG,is the std of the noise is the std of the final distribution?I don't have to predict a standard deviation of the distribution like the ppo，do i？

3、I used your set of initialization parameters for DDPG.And the β initialized to 1(or should it start at 0.0001?)。Is there anything wrong? { "learner": "DPG", "batchSize": 128, "clipImpWeight": 4, "encoderLayerSizes": [128], "epsAnneal": 5e-7, "explNoise": 0.2, "gamma": 0.995, "learnrate": 0.00001, "maxTotObsNum": 262144, "minTotObsNum": 131072, "nnLayerSizes": [128], "targetDelay": 0.001 }

4、I'm having some problems compiling your code on my MAC, so I can't see what the correct result is。Can you show me the general variation of the important parameters。（Such as far policy ratio,beta value,c_max and so on）. thank you .

novatig commented 4 years ago

1) Hi I ran it and got beta stabilizing around 0.8, there might be something wrong with your code betadpg.pdf retdpg.pdf Cmax is like yours, the number of far policy samples is always between 0.09 and 0.11

2) I don't understand. The stdev of actions is fixed to 0.2.

3) Yes, looks good.

4) Makefile is more fragile than CMake, try the new instructions (basically now they are the same for mac and linux).

codingliuyg commented 4 years ago

hello, 1、Can you show me your DDPG-policy structure of your network?Is it the same as the one below?

2、Is my ratio ρ calculation correct？And compare it directly with c_max.

self.ratio = (tf.reduce_prod(self.a_new_noise_policy.prob(self.action),axis=1))/(tf.reduce_prod(self.a_old_noise_policy.prob(self.action), axis=1))

self.kl = tf.reduce_sum(tf.distributions.kl_divergence(self.a_old_noise_policy, self.a_new_noise_policy), axis=1)

novatig commented 4 years ago

Hi, sorry for the delay.

Looks fine.
Again, I did not personally implement ReF-ER in tf yet, but your lines look correct. The probability of an action vector drawn from a multivariate normal with diagonal covariance is the product of the probabilities of each action component. The kl divergence in the same setting is the sum of 1D kl divergences. Check that with learning rate zero all the ratios always are one and kl always 0.

cselab / smarties

questions #2