Training Error & Docker Tips

lucarei commented 1 year ago

Hi Andrej.

I am testing the training phase through "ex_train.bash" with Panda and ENV="Grasp-OctreeWithIntensity-Gazebo-v0". My PC configuration is: i9-9940X, RTX 2080 TI, 64 GB RAM.

There are two problems:

Serious one: the simulations stops after 10K steps without a proper reason (i have experienced it 3 times, always same behavior).
In these 10K steps, the recorded score for grasping and lifting (ep_rew_mean_grasp and ep_rew_mean_lift in tensorboard) are always zero (worse results than Docker).

Here the results in tensorboard:

Here the error in terminal; it seems to be related to octrees but i don't know anything else:

Have you ever seen this behavior? Could you suggest me a solution?

AndrejOrsula commented 1 year ago

Hello,

Serious one: the simulations stops after 10K steps without a proper reason (i have experienced it 3 times, always same behavior).

I have never experienced an issue where the simulation would just stop. Sounds like a deadlock of some sort. The 10k steps sound oddly specific. Unfortunately, there are several things that happen at 10k steps as per default/your configuration:

save_freq is 10k, so it could halt here. Maybe the application has no writing permissions to the log directory?
eval_freq is 10k, which could mean that the evaluation cannot start for whatever reason. You can try to disable evaluations to see if it helps?
learning_starts defaults to 10k here, so maybe it halts when the network optimization/learning begins. But I am not sure why that would be the case. The octree error message could indicate that it is related to the training when sampling transitions from the buffer. Unfortunately, the unknown error is not very descriptive.

On the other hand, I have experienced the simulation crashing (not stopping) due to physics errors in cases when an object has excessive velocity (the rate of occurrence might increase with step_size). It is compensated by placing collision planes around the experiment, which greatly reduces/eliminates these crashes.

In these 10K steps, the recorded score for grasping and lifting (ep_rew_mean_grasp and ep_rew_mean_lift in tensorboard) are always zero (worse results than Docker).

As the learning does not begin until 10k steps, there shouldn't really be any improvement. However, if there is absolutely no reward, then maybe there is something wrong with the simulation. You would need to visualize it and see if you can spot some origin of the issue (e.g. collisions not working for whatever reason).

lucarei commented 1 year ago

Thank you for your support.

I modified the parameters you have cited: save_freq=20K, eval_freq=20K, learning_start decreased to 2K. The training has overcome the wall of 10K steps fortunately.

Now the remaining problem should be related to the reward (ep_rew_mean_grasp and ep_rew_mean_lift always equal to zero) and the success rate (also equal to zero), even if i decreased the value forlearning_start.

As the learning does not begin until 10k steps, there shouldn't really be any improvement. However, if there is absolutely no reward, then maybe there is something wrong with the simulation.

The environment in Gazebo (also seen from sensors in RVIZ) seems to be good enough. I can correctly see the robot (with all link frames), the texture and the objects to be grasped; also the randomization (for camera, terrain, objects) seems to work fine. There is just one problem: sometime happens that some spawned object fall under the terrain, passing the ground like a ghost. I supposed that this is not a issue related with my reward trouble since i experienced this phenomenon also in your latest Docker image.

It seems unexpected that the reward for grasp and lift is zero within 15-18K steps (my tests length). I am really scared that there is something corrupt in my installation. My idea is to test a pretrained agent but i red that this feature is disabled.

Note: Submodule pretrained_agents is currently incompatible with drl_grasping version 2.0.0.

Is it possible to enable it? Can i do it by myself or have you planned to activate this feature again?

AndrejOrsula commented 1 year ago

My idea is to test a pretrained agent but i red that this feature is disabled. Is it possible to enable it? Can i do it by myself or have you planned to activate this feature again?

As mentioned in one of my previous comments, I do not have plans to continue developing this repository. It would take considerable amount of time just to update to Gazebo garden, and I unfortunately do not have time for that due to other responsibilities.

If you want to test one of the pretrained agents, then I recommend you to use the prebuilt Docker images for versions 1.0.0 or 1.1.0. Both of these contain pretrained agents that should work.

Continuing on the same subject, I cannot guarantee that local installation is still functional due to updates of various dependencies. Therefore, an idea would be to just use Docker if it's possible for your use case. You could do all your development inside Docker (e.g. via VS Code remote/container extension), and then commit the changes to the Docker (docker commit). You could also clone the repository and mount the source code as volume such that you make sure to always have persistent changes. This could also reduce some development time on your side if you want to use UR robots (#107), as you could start off from Docker image 1.0.0/1.1.0 that already support this robot.

lucarei commented 1 year ago

It might be the best idea. I have already tested your project in Docker (and it works perfectly) but i don't know what are the development opportunities on that platform. For example, i should develop the following things:

Substitute UR5 with UR10e (sorry, in discussion #107 i made a mistake but it will remain for other developer);
Sim2Real implementation;

Are these points feasible on Docker (with version 1.1.0 obviously)?

AndrejOrsula commented 1 year ago

It should be possible, but I have not tried it.

Substitute UR5 with UR10e

You could take a look at ur5_rg2_ign and change the description and meshes to match the UR10e robot. Then just mount/commit your new repository inside the container.

Sim2Real implementation

It should be the same as if you installed it locally. With docker run --device ... (or docker run --privileged ... for simplicity as the cost of security), the container should be able to communicate with the real robot.

lucarei commented 1 year ago

I am following this path, just to be sure that everything works properly! Thanks so much for your patience.

AndrejOrsula / drl_grasping

Training Error & Docker Tips #108