Ericonaldo / visual_wholebody

Train a loco-manipulation dog with RL
https://wholebody-b1.github.io/
Other
76 stars 6 forks source link

Problem of reproducing high-level policy results #4

Open hatimwen opened 1 month ago

hatimwen commented 1 month ago

Hi,

Thank you for sharing this exciting work.

As mentioned in Issue https://github.com/Ericonaldo/visual_wholebody/issues/3#issuecomment-2156458140, it is challenging to reproduce the results of the high-level policy, with the final success rates being near zero.

I suspect the issue might stem from the grasping component. If the task considers reaching the target object as a success, the success rates during training could be nearly 100%, indicating that getting close to the object is easy. However, replacing all objects with a single cube (side length 0.045m) still results in a near-zero success rate. This suggests that even for simple shapes, the trained low-level policy struggles to grasp them effectively.

As @Ericonaldo mentioned in https://github.com/Ericonaldo/visual_wholebody/issues/3#issuecomment-2157052061, it was possible to achieve good results with your previous low-level model. However, my trained low-level model appears to perform well initially but fails during high-level training, which is quite confusing.

Btw, do you have any plans to release the pre-trained model weights?

If you have any progress on reproducing the results, please let me know. Thanks!

Chu4nQ1n commented 1 month ago

Hello,

I have also successfully trained the low-level model and obtained good results. However, it faces challenges during the high-level training phase, similar to what has been mentioned above. It appears that while the low-level policy performs well, it struggles when integrated into the high-level policy.

If you have any updates or progress in addressing this issue, or if there are plans to release the pre-trained model weights, I would greatly appreciate it if you could let me know.

Thank you for your attention and support!

hatimwen commented 1 month ago

Hi @Ericonaldo , I just run the b1z1-float setting, and similarly, the success rates are still near 0. I notice this setting does not require pretrained low-level policy. Thus, there may be something with the high-level part. Any comments?

Ericonaldo commented 3 weeks ago

Hi, sorry for the late response. Please try the latest commit and I have some learning results that can be referred here. I tested many epochs and I found out that in my running on a single 4090, the best model comes from 37000-38000. Try yourself and if you find any problem please let me know. Regarding the floating base, it should not be so high.

hatimwen commented 3 weeks ago

Hi, sorry for the late response. Please try the latest commit and I have some learning results that can be referred here. I tested many epochs and I found out that in my running on a single 4090, the best model comes from 37000-38000. Try yourself and if you find any problem please let me know. Regarding the floating base, it should not be so high.

Hi @Ericonaldo , Thanks a lot for your response. I just checked the wandb logs you provided and noticed a significant variance in the final performance. 😢

It's puzzling to pinpoint the source of this variance. One possibility is that it arises from the low-level model, where different iterations cause substantial differences in subsequent high-level training. Another possibility is that the variance is inherent to the high-level model training itself. For instance, if we consistently select one low-level model, such as the one at 37000 which showed the highest results in the wandb logs, and train the high-level model multiple times, what would the variance in the results be? These are just some of my thoughts, and I'm currently re-running the code you provided.

Regarding the floating base setting, my reproduced results are below 0.5%. Is this normal, or could it be due to the variance in high-level training, as mentioned in my second guess?

Btw, how do you judge a good low-level model? Based on mean rewards, different iterations of models seem to perform similarly (around 21-22). From your experience, is the low-level model with the highest mean rewards the one that most benefits further high-level training? And what is the range of mean rewards for the low-level models that you find beneficial in your experiments?

Thanks again!

Ericonaldo commented 3 weeks ago

Hi, basically, we trained a set of low-level policies and deploy them onto our real robot to find the best one, and we train and tune the high-level model based on that particular low-level policy. That may explain why only a few of them work better. One important note is, always taking the behavior in the real world as the principle to choose a model, as rewards may lie and agents may cheat.

hatimwen commented 3 weeks ago

I see.

I just checked the reproduction results using the latest code. Unfortunately, the results are still poor. I also tried training the high-level policy with different iterations of low-level models, but none of them achieved the impressive reward and success rate plots shown in your logs.

Could you please provide the pre-trained checkpoints for both the low-level and high-level models? Alternatively, could you share the wandb logs for the low-level training phase so I can compare them with mine?

Thanks again.

Ericonaldo commented 3 weeks ago

Hi, see the logs here. I uploaded one low-level model here for your reference.

hatimwen commented 2 weeks ago

Hi,

Using the provided well pre-trained low-level checkpoint, I’ve obtained some results for the high-level model.

This one achieves better performance than the high-level models trained with my low-level models. However, there is still a gap, and none of the results for all categories exceed a 20% success rate after 60k steps. It seems the training process for the high-level model still has significant variance.

zgdjcls commented 2 weeks ago

Hi, I face the same issue here, after using the provided low-level checkpoints, my high-level's success rate couldn't achieve 10% after 60k steps. This is my training inputs: python train_multistate.py --rl_device "cuda:7" --sim_device "cuda:7" --timesteps 60000 --headless --task B1Z1PickMulti --experiment_dir logs --wandb --wandb_project "b1-pick-multi-teacher" --wandb_name "teacher_baseline_37000" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick and I didn't change the code except for changing the type of some indices variables since the interpreter report it as a bug. Is there anything I need to modify to achieve a better performance(such as the coefficient of reward functions or input parameters)? @hatimwen Could you please share how did you train the high-level to achieve 20% success rate?

hatimwen commented 2 weeks ago

Hi, I face the same issue here, after using the provided low-level checkpoints, my high-level's success rate couldn't achieve 10% after 60k steps. This is my training inputs: python train_multistate.py --rl_device "cuda:7" --sim_device "cuda:7" --timesteps 60000 --headless --task B1Z1PickMulti --experiment_dir logs --wandb --wandb_project "b1-pick-multi-teacher" --wandb_name "teacher_baseline_37000" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick and I didn't change the code except for changing the type of some indices variables since the interpreter report it as a bug. Is there anything I need to modify to achieve a better performance(such as the coefficient of reward functions or input parameters)? @hatimwen Could you please share how did you train the high-level to achieve 20% success rate?

Hi @zgdjcls , I didn't change anything. For convenience, we could discuss in WeChat. My WeChat ID is wht2020zrj.

zgdjcls commented 2 weeks ago

Hi, I face the same issue here, after using the provided low-level checkpoints, my high-level's success rate couldn't achieve 10% after 60k steps. This is my training inputs: python train_multistate.py --rl_device "cuda:7" --sim_device "cuda:7" --timesteps 60000 --headless --task B1Z1PickMulti --experiment_dir logs --wandb --wandb_project "b1-pick-multi-teacher" --wandb_name "teacher_baseline_37000" --roboinfo --observe_gait_commands --small_value_set_zero --rand_control --stop_pick and I didn't change the code except for changing the type of some indices variables since the interpreter report it as a bug. Is there anything I need to modify to achieve a better performance(such as the coefficient of reward functions or input parameters)? @hatimwen Could you please share how did you train the high-level to achieve 20% success rate?

Hi @zgdjcls , I didn't change anything. For convenience, we could discuss in WeChat. My WeChat ID is wht2020zrj. just sent request to you

hatimwen commented 2 weeks ago

For reference, here's my wandb curve.

image
Ericonaldo commented 2 weeks ago

Hi, I rerun the high-level part and the results remain the same. Seems pretty stable to me as I set a fixed seed.

zgdjcls commented 2 weeks ago

Hi, I rerun the high-level part and the results remain the same. Seems pretty stable to me as I set a fixed seed.

Could you please share the weights and log files?

hatimwen commented 2 weeks ago

Hi, I rerun the high-level part and the results remain the same. Seems pretty stable to me as I set a fixed seed.

Hi,

I also tried rerunning the code multiple times, and the results stayed below 20%. I’ve installed the packages using the same versions as yours, according to the provided wandb. I still can’t figure out the reason for the discrepancy. 🤔

Ericonaldo commented 2 weeks ago

Hi, see the logs here. I uploaded one low-level model here for your reference.

@zgdjcls It's already shared

Ericonaldo commented 2 weeks ago

Hi, I rerun the high-level part and the results remain the same. Seems pretty stable to me as I set a fixed seed.

Hi,

I also tried rerunning the code multiple times, and the results stayed below 20%. I’ve installed the packages using the same versions as yours, according to the provided wandb. I still can’t figure out the reason for the discrepancy. 🤔

Yeah, it's weird... You mentioned that you use 3090? My high-level part is also trained on a 3090 server.

hatimwen commented 2 weeks ago

Hi, I rerun the high-level part and the results remain the same. Seems pretty stable to me as I set a fixed seed.

Hi, I also tried rerunning the code multiple times, and the results stayed below 20%. I’ve installed the packages using the same versions as yours, according to the provided wandb. I still can’t figure out the reason for the discrepancy. 🤔

Yeah, it's weird... You mentioned that you use 3090? My high-level part is also trained on a 3090 server.

Yes, I use a 3090 server.

zgdjcls commented 2 weeks ago

Hi, see the logs here. I uploaded one low-level model here for your reference.

@zgdjcls It's already shared

Could you please share the high-level weights? Since we run the high-level part and there is a huge difference between yours and ours. We want to use your high success-rate teacher model to further train the student model

sinaqahremani commented 3 days ago

Hi. I faced the same issues as mentioned by everyone here. My high level has a huge gap with what you shared. I am using the low-level model which is shared and without any change in the code. My GPU is Quadro RTX 6000. Does the model of the GPU have a huge impact on the result?