TianhongDai / hindsight-experience-replay

This is the pytorch implementation of Hindsight Experience Replay (HER) - Experiment on all fetch robotic environments.
MIT License
402 stars 75 forks source link

Why single process on Push not work #19

Open Ericonaldo opened 3 years ago

Ericonaldo commented 3 years ago

Hi, Tianhong, thanks for sharing the code. I've tried to run your code based on the guidance in readme

mpirun -np 8 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

BUt surprisingly I find that running

mpirun -np 1 python -u train.py --env-name='FetchPush-v1' 2>&1 | tee push.log

does not work at all.

Do you happen to know the reason why it does not work?

Ericonaldo commented 3 years ago

I find that with a larger batch size, HER still not work, do you know why?

TianhongDai commented 3 years ago

@Ericonaldo Hi, in actually, MPI = a large batch size. Could I know what is the batch size (a larger batch size) when you train the push task, please?

Ericonaldo commented 3 years ago

Hi, I've tried 4 processes and 2 processes, they both work but a single process with 2048 batch size cannot work.

TianhongDai commented 3 years ago

@Ericonaldo Hi - What I guess is because of the diversity of samples - before the agent updates the network, if you use single process, in each epoch, it will only collect 250 = 100 episodes. Then, the agent will sample batch size of episodes from replay buffer and sample one transition from each of sampled episode for the training. In this case, even for 50 epochs, the agent only collects 5000 unique episodes (50 100). Although you use batch_size=2048, the diversity of samples is still limited when num_process=1 and you will sample numbers of repeated episodes during training. However, when you use num_process=2, the agent can sample transitions from double-sized collected episodes. But I'm not sure if this is the real reason and welcome for the further discussion.

Ericonaldo commented 3 years ago

If this is true, we should be able to succeed by scaling the number of episodes by K times? However, it seems not work either.

TianhongDai commented 3 years ago

@Ericonaldo Hmm - that's a good point. An interesting finding is here: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 . I follow the setting of OpenAI, they use sum instead of avg to gather the gradient from MPI workers. I will try to use avg operation to see if it will affect the performance. Will update here later.

Ericonaldo commented 3 years ago

Great and many thanks. I did this because I find my own implementation of HER can only reach a success rate of 70-80% and I am figuring out what really matters in the training.

TianhongDai commented 3 years ago

@Ericonaldo Yes - it's quiet tricky of HER implementation...

TianhongDai commented 3 years ago

@Ericonaldo I found that the SUM operator will influence the performance: https://github.com/TianhongDai/hindsight-experience-replay/blob/master/mpi_utils/mpi_utils.py#L21-L22 Here, instead of using SUM, I average the gradient according to the number of MPI workers as:

comm.Allreduce(flat_grads, global_grads, op=MPI.SUM)
# average the gradient.
global_grads /= comm.Get_size()

Then, I plot the training curve using 2 MPI workers, and when the gradient is averaged, the performance will drop. In this case - if we don't average the gradient, the update of the network will become something like: x' = x - (lr * num_mpi) * avg_grad (assume it's a simple SGD optimizer), and the learning rate is increased. I'm not sure if this is the main reason, but we can keep doing more experiments to verify it.

plot

Ericonaldo commented 3 years ago

This seems an important reason, but when I run with a single process, it just can not get any evidence of learning... (at least the avg gradient of 2 processes works slowly)

TianhongDai commented 3 years ago

@Ericonaldo Yes - I agree, need to carry out more experiment to verify. We can use this channel to continue the discussion.

Ericonaldo commented 2 years ago

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

TianhongDai commented 2 years ago

I think the learning rates for both the policy network and the value network are important hyper-parameters for these goal-conditioned tasks, after fine-tune some values I found that with only a single process can achieve some good results.

@Ericonaldo Thanks! This is a great finding.