Open Ericonaldo opened 3 years ago
I find that with a larger batch size, HER still not work, do you know why?
@Ericonaldo Hi, in actually, MPI = a large batch size
. Could I know what is the batch size (a larger batch size) when you train the push task, please?
Hi, I've tried 4 processes and 2 processes, they both work but a single process with 2048 batch size cannot work.
@Ericonaldo Hi - What I guess is because of the diversity of samples - before the agent updates the network, if you use single process, in each epoch, it will only collect 250 = 100 episodes. Then, the agent will sample batch size of episodes from replay buffer and sample one transition from each of sampled episode for the training. In this case, even for 50 epochs, the agent only collects 5000 unique episodes (50 100). Although you use batch_size=2048
, the diversity of samples is still limited when num_process=1
and you will sample numbers of repeated episodes during training. However, when you use num_process=2
, the agent can sample transitions from double-sized collected episodes. But I'm not sure if this is the real reason and welcome for the further discussion.
If this is true, we should be able to succeed by scaling the number of episodes by K times? However, it seems not work either.
@Ericonaldo Hmm - that's a good point. An interesting finding is here: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 . I follow the setting of OpenAI, they use sum
instead of avg
to gather the gradient from MPI workers. I will try to use avg
operation to see if it will affect the performance. Will update here later.
Great and many thanks. I did this because I find my own implementation of HER can only reach a success rate of 70-80% and I am figuring out what really matters in the training.
@Ericonaldo Yes - it's quiet tricky of HER implementation...
@Ericonaldo I found that the SUM
operator will influence the performance: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22
Here, instead of using SUM
, I average the gradient according to the number of MPI workers as:
comm.Allreduce(flat_grads, global_grads, op=MPI.SUM)
# average the gradient.
global_grads /= comm.Get_size()
Then, I plot the training curve using 2 MPI workers, and when the gradient is averaged, the performance will drop. In this case - if we don't average the gradient, the update of the network will become something like: x' = x - (lr * num_mpi) * avg_grad
(assume it's a simple SGD optimizer), and the learning rate is increased. I'm not sure if this is the main reason, but we can keep doing more experiments to verify it.
This seems an important reason, but when I run with a single process, it just can not get any evidence of learning... (at least the avg gradient of 2 processes works slowly)
@Ericonaldo Yes - I agree, need to carry out more experiment to verify. We can use this channel to continue the discussion.
I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.
I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.
@Ericonaldo Thanks! This is a great finding.
Hi, Tianhong, thanks for sharing the code. I've tried to run your code based on the guidance in readme
BUt surprisingly I find that running
does not work at all.
Do you happen to know the reason why it does not work?