AndrejOrsula / drl_grasping

Deep Reinforcement Learning for Robotic Grasping from Octrees
https://arxiv.org/pdf/2208.00818
BSD 3-Clause "New" or "Revised" License
404 stars 54 forks source link

Training Error & Docker Tips #108

Closed lucarei closed 1 year ago

lucarei commented 1 year ago

Hi Andrej.

I am testing the training phase through "ex_train.bash" with Panda and ENV="Grasp-OctreeWithIntensity-Gazebo-v0". My PC configuration is: i9-9940X, RTX 2080 TI, 64 GB RAM.

There are two problems:

  1. Serious one: the simulations stops after 10K steps without a proper reason (i have experienced it 3 times, always same behavior).
  2. In these 10K steps, the recorded score for grasping and lifting (ep_rew_mean_grasp and ep_rew_mean_lift in tensorboard) are always zero (worse results than Docker).

Here the results in tensorboard:

image image

Here the error in terminal; it seems to be related to octrees but i don't know anything else:

image

Have you ever seen this behavior? Could you suggest me a solution?

AndrejOrsula commented 1 year ago

Hello,

  1. Serious one: the simulations stops after 10K steps without a proper reason (i have experienced it 3 times, always same behavior).

I have never experienced an issue where the simulation would just stop. Sounds like a deadlock of some sort. The 10k steps sound oddly specific. Unfortunately, there are several things that happen at 10k steps as per default/your configuration:

On the other hand, I have experienced the simulation crashing (not stopping) due to physics errors in cases when an object has excessive velocity (the rate of occurrence might increase with step_size). It is compensated by placing collision planes around the experiment, which greatly reduces/eliminates these crashes.

  1. In these 10K steps, the recorded score for grasping and lifting (ep_rew_mean_grasp and ep_rew_mean_lift in tensorboard) are always zero (worse results than Docker).

As the learning does not begin until 10k steps, there shouldn't really be any improvement. However, if there is absolutely no reward, then maybe there is something wrong with the simulation. You would need to visualize it and see if you can spot some origin of the issue (e.g. collisions not working for whatever reason).

lucarei commented 1 year ago

Thank you for your support.

I modified the parameters you have cited: save_freq=20K, eval_freq=20K, learning_start decreased to 2K. The training has overcome the wall of 10K steps fortunately.

Now the remaining problem should be related to the reward (ep_rew_mean_grasp and ep_rew_mean_lift always equal to zero) and the success rate (also equal to zero), even if i decreased the value forlearning_start.

As the learning does not begin until 10k steps, there shouldn't really be any improvement. However, if there is absolutely no reward, then maybe there is something wrong with the simulation.

The environment in Gazebo (also seen from sensors in RVIZ) seems to be good enough. I can correctly see the robot (with all link frames), the texture and the objects to be grasped; also the randomization (for camera, terrain, objects) seems to work fine. There is just one problem: sometime happens that some spawned object fall under the terrain, passing the ground like a ghost. I supposed that this is not a issue related with my reward trouble since i experienced this phenomenon also in your latest Docker image.

It seems unexpected that the reward for grasp and lift is zero within 15-18K steps (my tests length). I am really scared that there is something corrupt in my installation. My idea is to test a pretrained agent but i red that this feature is disabled.

Note: Submodule pretrained_agents is currently incompatible with drl_grasping version 2.0.0.

Is it possible to enable it? Can i do it by myself or have you planned to activate this feature again?

AndrejOrsula commented 1 year ago

My idea is to test a pretrained agent but i red that this feature is disabled. Is it possible to enable it? Can i do it by myself or have you planned to activate this feature again?

As mentioned in one of my previous comments, I do not have plans to continue developing this repository. It would take considerable amount of time just to update to Gazebo garden, and I unfortunately do not have time for that due to other responsibilities.

If you want to test one of the pretrained agents, then I recommend you to use the prebuilt Docker images for versions 1.0.0 or 1.1.0. Both of these contain pretrained agents that should work.

Continuing on the same subject, I cannot guarantee that local installation is still functional due to updates of various dependencies. Therefore, an idea would be to just use Docker if it's possible for your use case. You could do all your development inside Docker (e.g. via VS Code remote/container extension), and then commit the changes to the Docker (docker commit). You could also clone the repository and mount the source code as volume such that you make sure to always have persistent changes. This could also reduce some development time on your side if you want to use UR robots (#107), as you could start off from Docker image 1.0.0/1.1.0 that already support this robot.

lucarei commented 1 year ago

It might be the best idea. I have already tested your project in Docker (and it works perfectly) but i don't know what are the development opportunities on that platform. For example, i should develop the following things:

  1. Substitute UR5 with UR10e (sorry, in discussion #107 i made a mistake but it will remain for other developer);
  2. Sim2Real implementation;

Are these points feasible on Docker (with version 1.1.0 obviously)?

AndrejOrsula commented 1 year ago

It should be possible, but I have not tried it.

  1. Substitute UR5 with UR10e

You could take a look at ur5_rg2_ign and change the description and meshes to match the UR10e robot. Then just mount/commit your new repository inside the container.

  1. Sim2Real implementation

It should be the same as if you installed it locally. With docker run --device ... (or docker run --privileged ... for simplicity as the cost of security), the container should be able to communicate with the real robot.

lucarei commented 1 year ago

I am following this path, just to be sure that everything works properly! Thanks so much for your patience.