alirezakazemipour / DDPG-HER

Implementation of the Deep Deterministic Policy Gradient and Hindsight Experience Replay.
88 stars 18 forks source link

Request Suggestions For Hand Environments #3

Closed cm107 closed 2 years ago

cm107 commented 3 years ago

First of all, thank you very much for writing a pytorch implementation of DDPG+HER. I found that this implementation works very well for all of the Fetcher environments available in gym. Example: FetchPickAndPlace-v1

However, using the same approach for the Hand environments doesn't seem to work as well. Example: HandManipulateEgg-v0

It's understandable that the performance wouldn't be as good since the Hand environments seem more difficult than the Fetcher environments. I was hoping that I could increase the success rate by increasing the number of epochs, but the problem at hand doesn't seem to be that simple.

Does anyone have any suggestions for how to improve the current repository so that it achieves a higher success rate for the Hand environments?

alirezakazemipour commented 3 years ago

Hello (@cm107),

Thanks for sharing your thoughts. At the time of working on this project, I was only obsessed to solve the FetchPickAndPlace-v1 environment. For example, this line is irrelevant for other environments as it prevents solving the FetchPickAndPlace-v1 just by random spawning of the box and the red-dotted location. So, I have not inspected Hand environments.

Despite your kind opinion about the code, it's a bit sloppy since there are still many commented lines in it and it is highly appreciated if you make a PR so if you ever ended up solving Hand environments then, the current repository would include that too.

And my final thought about Hand environments is that they might need a large number of workers or smaller Learning Rates, what's your opinion about that?

cm107 commented 3 years ago

@alirezakazemipour

At the time of working on this project, I was only obsessed to solve the FetchPickAndPlace-v1 environment.

I see. As a matter of fact, I have tested all of the robotics environments with this project's implementation of DDPG+HER. The implementation works very well with not just FetchPickAndPlace-v1, but with FetchPush-v1 and FetchReach-v1 as well. FetchSlide-v1 had about a 0.5~0.7 success rate after 200 epochs, but that's just because the task was more difficult than the others.

For example, this line is irrelevant for other environments as it prevents solving the FetchPickAndPlace-v1 just by random spawning of the box and the red-dotted location. So, I have not inspected Hand environments.

For the Hand environments, it looks like the goal vector is (x, y, z, qw, qx, qy, qz), where x, y, z corresponds to position and qw, qx, qy, qz corresponds to the angle quaternion. Indeed, calculating the norm of this vector is not appropriate because position and angle should be thresholded seperately. That being said, I do not see that line of code as a particular problem for the Hand environments, as it just ensures that the agent doesn't train on trivial scenarios where the achieved_goal spawns too close to the desired_goal. Even if it's not being applied appropriately for the Hand environment, it's still able to sufficiently prevent trivial cases.

Despite your kind opinion about the code, it's a bit sloppy since there are still many commented lines in it and it is highly appreciated if you make a PR so if you ever ended up solving Hand environments then, the current repository would include that too.

Right now I'm doing my testing in a private repo, and so far I have only made changes to the class interfaces. (This is just for my own convenience, so the changes are trivial so far.) If I make any progress with the Hand environments I'll be sure to fork it and make a pull request. 👍

And my final thought about Hand environments is that they might need a large number of workers or smaller Learning Rates, what's your opinion about that?

I'm still new to RL, so I don't have much of an insight to give yet, but I'll give your suggestions a try. By number of workers, are you referring to this minibatch? (Defined by MAX_EPISODES)

I'll try lowering these learning rates.

Do you think that changing parameters related to the q-function would help?

This is just a guess on my part, but do you think adjusting k_future would help? This directly relates to the proportion of goals in the replay buffer that are replaced with future goals, right?

cm107 commented 3 years ago

For the Hand environments, it looks like the goal vector is (x, y, z, qw, qx, qy, qz), where x, y, z corresponds to position and qw, qx, qy, qz corresponds to the angle quaternion.

Now that I think about it, this may be the key for improving the performance of the model with the Hand environments. I notice that the trained hand model is usually able to carry the cube/egg/pen to the correct position if the initial angle is already correct, but it has a hard time rotating the object. It may help to use a dynamic threshold to encourage the model to focus on learning how to rotate the object at the beginning of training. (In other words, make sure that the goal distance isn't too large, but also make sure that the goal rotation isn't too small.) It might not work, but I guess it's worth a try. I might try this next week after I try training with smaller learning rates.

alirezakazemipour commented 3 years ago

@cm107

I see. As a matter of fact, I have tested all of the robotics environments with this project's implementation of DDPG+HER. The implementation works very well with not just FetchPickAndPlace-v1, but with FetchPush-v1 and FetchReach-v1 as well. FetchSlide-v1 had about a 0.5~0.7 success rate after 200 epochs, but that's just because the task was more difficult than the others.

Gald to hear that! :heart_eyes: It would be great if you add your information to the readme. :star_struck:

For the Hand environments, it looks like the goal vector is (x, y, z, qw, qx, qy, qz), where x, y, z corresponds to position and qw, qx, qy, qz corresponds to the angle quaternion. Indeed, calculating the norm of this vector is not appropriate because position and angle should be thresholded seperately. That being said, I do not see that line of code as a particular problem for the Hand environments, as it just ensures that the agent doesn't train on trivial scenarios where the achieved_goal spawns too close to the desired_goal. Even if it's not being applied appropriately for the Hand environment, it's still able to sufficiently prevent trivial cases.

The reason I used 0.05 in this part of the code and the prevention of random solutions of the problem by this part:

while np.linalg.norm(achieved_goal - desired_goal) <= 0.05:
      env_dict = env.reset()
      state = env_dict["observation"]
      achieved_goal = env_dict["achieved_goal"]
      desired_goal = env_dict["desired_goal"]

is based on this line, I think for Hand environments you should take a look at here to find such lines.

Right now I'm doing my testing in a private repo, and so far I have only made changes to the class interfaces. (This is just for my own convenience, so the changes are trivial so far.) If I make any progress with the Hand environments I'll be sure to fork it and make a pull request. +1

That's great! :star_struck:

I'm still new to RL, so I don't have much of an insight to give yet, but I'll give your suggestions a try. By number of workers, are you referring to this minibatch? (Defined by MAX_EPISODES)

No, when you execute the command mpirun -np $(nproc) python3 -u main.py multiple parallel workers start to interact with their respective environments. The number of workers is specified by $(nproc) which is equal to the number of your CPU cores. You may find it useful to increase it beyond the number of CPU cores (the original paper uses 19 workers while my machine has only 8) however, it may cause your machine to tolerate pressure beyond its recommended capabilities and THAT'S NOT CONVENIENT!

Do you think that changing parameters related to the q-function would help?

This is just a guess on my part, but do you think adjusting k_future would help? This directly relates to the proportion of goals in the replay buffer that are replaced with future goals, right?

I think tau and gamma possess good values (they're chosen as they're in the paper). You may find smaller tau helpful.
To choose an appropriate k_future, there is a deep analysis in the paper so, I recommend you to take a look at it. As it shows that increasing k_future does not necessarily improve the performance.