Potential issue and solution for poor high-level performance

Ericonaldo commented 3 months ago

Hi all,

Recently, @tenhearts reported a problem in high-level training: the object sometimes may penetrate the table, below is a screenshot from her. This may cause many failed grasp cases and she said that she finally got a great performance after fixing this problem (but maybe not the right solution).

At first I thought it was a bug of setting table heights, so I checked for a while. Finally, I found that it is not relevant to the code, but the problem of collision check of issacgym. It is important to note that the precision of collision check may related to the hardware and even the number of parallel environments, so I guess that may be the reason why people get different results on different machines. But it is stable on one particular machine and easy to reproduce.

According to the official document, we can change the contact_offset parameter (in config file) to prevent from similar problems. Special thanks to @CreeperLin to help me debugging on this.

Here are some examples I tested on my 4060 PC:

contact_offset=0.02, the default case in this codebase, very obvious penetration problem, the related environment is kept resetting due to the 'fall down' termination condition:

https://github.com/user-attachments/assets/069b7c2c-126f-4053-9a59-0683ee11ac97

changing contact_offset=0.04, the problem is well-solved.

https://github.com/user-attachments/assets/7d16c10a-fd69-4824-b679-934df0015461

Can you guys try to train your own high-level model by fixing this issue? I'd like to help if there are more problems.

hatimwen commented 3 months ago

Thanks! I'll try it and share my results here later.

sinaqahremani commented 3 months ago

I have started the training considering this issue, I'll let you know the result tomorrow.

sinaqahremani commented 3 months ago

Prior to this idea executed the algorithm on a HP PC with i9 cpu, 32 GB Memory and RTX 6000 GPU and my fixed seed was 101. Now, I executed on the same system with contact_offset=0.04 and contact_offset=0.03 on the same system with the same seed. here are the success rates:

As you can see for my case the algorithm is getting worse.

hatimwen commented 3 months ago

Hi,

I've encountered an issue with the updated code when setting contact_offset=0.04. After approximately 20k training steps, the following errors consistently occur:

/buildAgent/work/45f70df4210b2e3e/source/gpunarrowphase/src/PxgNarrowphaseCore.cpp (1466) : internal error : Contact buffer overflow detected, please increase its size in the scene desc!

/buildAgent/work/45f70df4210b2e3e/source/gpunarrowphase/src/PxgNphaseImplementationContext.cpp (710) : internal error : Contact buffer overflow detected, please increase its size in the scene desc!

/buildAgent/work/45f70df4210b2e3e/source/physx/src/NpScene.cpp (3189) : internal error : PhysX Internal CUDA error. Simulation can not continue!

/buildAgent/work/45f70df4210b2e3e/source/gpunarrowphase/src/PxgNarrowphaseCore.cpp (9908) : internal error : GPU compressContactStage1 fail to launch kernel stage 1!!

/buildAgent/work/45f70df4210b2e3e/source/gpunarrowphase/src/PxgNarrowphaseCore.cpp (9945) : internal error : GPU compressContactStage2 fail to launch kernel stage 1!!

[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 4084
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 4092
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 3362
[Error] [carb.gym.plugin] Gym cuda error: an illegal memory access was encountered: ../../../source/plugins/carb/gym/impl/Gym/GymPhysX.cpp: 3417

Do you have any suggestions on how to resolve this?

Additionally, I ran some performance comparisons between contact_offset settings of 0.02 and 0.04. You can find the details here: wandb log. Setting contact_offset to 0.04 does show some improvements, but the issue mentioned above is a blocker.

Thanks in advance for your help!

Ericonaldo commented 3 months ago

Can you try to increase max_gpu_contact_pairs？

hatimwen commented 3 months ago

Can you try to increase max_gpu_contact_pairs？

ok, I will try it.

sinaqahremani commented 3 months ago

Any thoughts on my case? @Ericonaldo

Ericonaldo commented 3 months ago

@sinaqahremani Can you try to add a --debug para during training for visualizing and find some potential problems?

sinaqahremani commented 3 months ago

@Ericonaldo When I use --debug it won't upload data to Wandb!

Ericonaldo commented 3 months ago

you do not have to upload, just to visualize and check if there is anything weird

sinaqahremani commented 3 months ago

@Ericonaldo I executed in debug mode, nothing weird happened. however, the success rates are about zero, and they don't change.

Ericonaldo commented 3 months ago

Actually, the try3 line looks great, have you tried continuing that training and check the results?

sinaqahremani commented 3 months ago

you mean continuing more than 60000 steps? Actually, I thought I need to get to what you got in 60000 steps.

Ericonaldo commented 3 months ago

I am not sure about your case for now (can use my latest uploaded low-level model if you have not tried that), but once I find out sth I will let you know. I also encourage you to try and test, for example, setting a static table height to see if the learning goes well. If you find out something, please also let me know :-0

hatimwen commented 3 months ago

Can you try to increase max_gpu_contact_pairs？

ok, I will try it.

Hi @Ericonaldo ,

Progress Update:

I increased max_gpu_contact_pairs from 1M to 2M, which allowed the experiment with contact_offset=0.04 to run smoothly.

However, after 100k training steps, the final success rates remain low (~4%). This is still significantly below your reported results.

You can find my training logs here.

Screenshot:

Any suggestions?

Ericonaldo commented 3 months ago

I don't have any idea now. Maybe you can try to set a static height table and check if it works well (e.g., set the table height as 0 and learn to grasp from the floor only). And always visualize the scene you built to ensure there is no bug.

sinaqahremani commented 3 months ago

@Ericonaldo is that possible to make the teacher policy you are using publicly available?

Ericonaldo commented 3 months ago

I cannot open-source the one shown in the paper as the related low-level policy is currently used to support our robots, that's why I encourage you to train your own model. Although I can train a new one using the public low-level policy (can do in days), it is meaningless to simply eval my model...

tenhearts commented 3 months ago

I still have no idea about why performances differ on different devices. But at least on the same GPU (RTX 3090) as author used, I can reach the same results after fixed this issue (also with fixed height table). I trained 10k iterations here, so the successful rate is even increased to higher. wandb

hatimwen commented 3 months ago

I don't have any idea now. Maybe you can try to set a static height table and check if it works well (e.g., set the table height as 0 and learn to grasp from the floor only). And always visualize the scene you built to ensure there is no bug.

Hi @Ericonaldo ,

Just trained with a fixed table height (0.25m), and the results have significantly improved:

From the screenshot, it’s clear that most categories perform well, with the exception of Ball, which has a very low success rate (~1.77%), and Bowl, which also has a lower success rate (55.13%).

Btw, from my perspective, ensuring the reproducibility of the algorithm is crucial for better understanding and following this exciting work. That’s why I’ve been working alongside others to reproduce the results reported in the paper.

Still, thank you very much for making the code available and for your assistance!

Ericonaldo commented 3 months ago

@hatimwen I never regarded it as not crucial to reproduce the results, but it just works from my side, always, as I showed in my wandb, and I also made efforts to find possible problems. But currently, in my schedule, there are always more important things to do, and I cannot test the algorithm from different devices to make sure it all works the same. From my perspective, I have proven the usefulness of this method, and since the results are different from devices (most probably due to issacgym), you can tune for your own by referring to the codebase, if you really need the algorithm work on your side.

hatimwen commented 3 months ago

@hatimwen I never regarded it as not crucial to reproduce the results, but it just works from my side, always, as I showed in my wandb, and I also made efforts to find possible problems. But currently, in my schedule, there are always more important things to do, and I cannot test the algorithm from different devices to make sure it all works the same. From my perspective, I have proven the usefulness of this method, and since the results are different from devices (most probably due to issacgym), you can tune for your own by referring to the codebase, if you really need the algorithm work on your side.

Hi @Ericonaldo ,

I genuinely appreciate your efforts and just wanted to clarify why this is important to me. There was no other intent, and I apologize if my previous words were unclear.

The method is impressive. And I agree the issue most likely comes from the environment instead of the algorithm.

Thanks a lot for your suggestions and help in this matter!

Ericonaldo / visual_wholebody

Potential issue and solution for poor high-level performance #11