This is the repository of the winning solution to the MyoChallenge 2023 (object manipulation track) by team Lattice. Our team is named after the exploration method for high-dimensional environments, Lattice, introduced in our NeurIPS 2023 paper Latent Exploration for Reiforcement Learning.
Our team comprised:
Alberto Chiappa and Alexander Mathis were also part of the winning team of the MyoChallenge 2022. This work is already published, follow the link to this project to find the articles. Our 2023 winning solutions is as of yet unpublished (stay tuned for more).
Here is a sample of what our policy can do:
We outperformed the other best solutions both in score and effort:
We strongly recommend using docker for maximum reproducibility of the results. We provide the utility scripts docker/build.sh
and docker/test.sh
to create a docker image including all the necessary libraries and training/evaluation scripts.
Simply run the script docker/build.sh
to create a docker image called myochallengeeval_mani_agent
. The image is fairly large (~20 Gb) because it was built on top of an image provided by Nvidia to run the library IsaacGym.
Once the image is created, run the script docker/test.sh
to execute the script src/test_submission.py
inside a container created from the image myochallengeeval_mani_agent:latest
. The script src/test_submission.py
executes 1000 test episode in the environment myoChallengeRelocateP2-v0
with seed=0. The performance should match exactly the one we obtained, namely, 0.817 (817 episodes solved out of 1000).
By default, the script src/test_submission.py
tests the last step of the curriculum (from the folder output/trained_agents/curriculum_step_10
). To test a different pretrained agent, please change the value of the variables EXPERIMENT_PATH
and CHECKPOINT_NUM
in the script src/test_submission.py
. Make sure the checkpoint number corresponds to that of the curriculum step you want to load. Only the curriculum steps 8, 9 and 10 have been trained on the full Relocate task, so we expect the previous checkpoints to perform badly in the full environment.
The script docker/train.sh
can be used to run a training experiment. We set it so that the training starts from checkpoint 9 of the training curriculum. In the current state, the training will not reproduce the training experiment which lead to checkpoint 10, as the script is not loading the arguments from output/trained_agents/curriculum_step_10/args.json
. In fact, for the challenge, we used a cluster which accepts parameters in a different format. We did not adapt this part of the code to run in the docker container of the submission. Furthermore, for the challenge trainings we used 250 environments in parallel, requiring substantial RAM resources, unlikely to be available in a standard workstation.
The first key component in our model is the recurrent units in both the actor and critic networks of our on-policy algorithm. The first layer was an LSTM layer, which was crucial to deal with the partial observability of the environment. Especially in Phase II, the shape, mass and friction of the object change every episode and do not figure in the observation. Recurrency enables the policy to store memory of the past observations, which might be aggregate to infer such unobservable quantities.
The second key component we used was Lattice, as exploration strategy recently developed by our team. By injecting noise in the latent space, Lattice can encourage correlations across actuators that are beneficial for task performance and energy efficiency, especially for high-dimensional musculoskeletal models with redundant actuators. Given the complexity of the task, Lattice allowed us to efficiently explore the state space. Since we found little evidence that using state-dependent perturbations is beneficial with PPO, we used an unpublished version of Lattice which does not make the noise dependent on the current state. However, it still uses the linear transformation implemented by the last layer of the policy network to induce correlated noise across muscles.
Third, we used a curriculum of training that gradually increased the difficulty of the task. For both phase 1 and phase 2, we used the same training curriculum steps:
Directly transferring the policy of phase 1 to phase 2 was not possible due to the introduction of complex objects and targets. Therefore, we repeated the same curriculum steps with a longer training for phase 2 but we encouraged a more diverse and efficient exploration by using Lattice. We include all the details about the hyperparameters used for the different steps of the curriculum in the files output/trained_agents/curriculum_step_<step_number>/args.json
. The environment configuration and the model configuration are also stored separately in output/trained_agents/curriculum_step_<step_number>/env_config.json
and output/trained_agents/curriculum_step_<step_number>/model_config.json
, respectively. Please note that throughout the curriculum we also made small modifications to the environment, which break the compatibility of the solutions up to step 6 with the final environment of phase 2. To allow the reader to evaluate steps 1-6 in the environments where they were trained, and potentially reproduce all the training steps, we include the version of relocate.py
and main_challenge_manipulation_phase2.py
used for the training in the corresponding folder.
The final insight we tried to incorporate consisted in enlarging the hyperparameter space to obtain a more robust policy. Indeed, we observed that the policy almost reached convergence but it was struggling with objects at the extrame of the range (e.g. small objects). To this end, we made the task harder by increasing the range of shape, friction, mass object hyperparameters. Since part of the reward still consisted to grasp the object and lead it on top of the box, it allowed the policy to continue maximing the task performance while learning to grasp objects at the "extreme" of the hyperparemeter space. Furthermore, we hypothesized that the submissions would be tested on out of distribution objects and targets. Indeed, while our best performing policies obtained scores above 80% in our local tests, they scored just above 30% upon submission.
For the very final submission that scored 0.343, we used our final robust policy that can be found here.
Further details about the curriculum steps and architeture can be found in appendix.
We observed that unnecessary movement of the agent took place in the following cases:
To reduce the global effor of the policy we operated the following post-training modifications to the policy:
If you want to read more about our solution, check out our NeurIPS work!
If you use our code, or ideas please cite:
@article{chiappa2023latent,
title={Latent exploration for reinforcement learning},
author={Chiappa, Alberto Silvio and Vargas, Alessandro Marin and Huang, Ann Zixiang and Mathis, Alexander},
journal={arXiv preprint arXiv:2305.20065},
year={2023}
}