URLS aims to provide a set of unsupervised reinforcement learning algorithms and experiments for the purpose of researching the applicability of unsupervised reinforcement learning to a variety of paradigms.
The codebase is based upon URLB and ExORL. Further details are provided in the following papers:
URLS is intended as a successor to URLB allowing for an increased number of experiments and RL paradigms.
Install MuJoCo if it is not already the case:
~/.mujoco/
.LD_LIBRARY_PATH
.Install the following libraries:
sudo apt update
sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3 unzip
Install dependencies:
conda env create -f conda_env.yml
conda activate urls-env
We provide the following workflows:
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Fine-tuning, learn with the pre-trained agent on a specific, task specific reward is now used for the agent
python finetune.py pretrained_agent=UNSUPERVISED_AGENT task=TASK snapshot_ts=TS obs_type=OBS_TYPE
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Sampling, sample demos from agent replay buffer on a specific task
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE
Offline-learning, learn a policy using the offline data collected on the specific task.
python train_offline.py agent=OFFLINE_AGENT expl_agent=UNSUPERVISED_AGENT task=TASK
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Sampling, sample demos from agent replay buffer with constraints and images
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE
Trajectories to Images, create image dataset from trajectories
python data_to_images.py --env=DOMAIN
Train VAE, train Variational Auto Encoder from the image dataset
python train_encoder.py --env=DOMAIN
Train MPC, train LS3 safe model predictive controller on specific domain
python train_mpc.py --env=DOMAIN
The following unsupervised reinforcement learning agents are available, replace UNSUPERVISED_AGENT
with Command.
For example to use DIAYN, set UNSUPERVISED_AGENT
= diayn
.
Agent | Command | Type | Implementation Author(s) | Paper | Intrinsic Reward |
---|---|---|---|---|---|
ICM | icm |
Knowledge | Denis | paper | $| | g(\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}{t}) - \mathbf{z}{t+1} | | ^{2}$ |
Disagreement | disagreement |
Knowledge | Catherine | paper | $Var{ g{i} (\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}{t}) }$ |
RND | rnd |
Knowledge | Kevin | paper | $| | g(\mathbf{z}{t}, \mathbf{a}{t}) - \tilde{g}(\mathbf{z}{t}, \mathbf{a}{t}) | | ^{2}_{2}$ |
APT(ICM) | icm_apt |
Data | Hao, Kimin | paper | $\sum{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}_{j} | |$ |
APT(Ind) | ind_apt |
Data | Hao, Kimin | paper | $\sum{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}_{j} | |$ |
ProtoRL | proto |
Data | Denis | paper | $\sum{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}_{j} | |$ |
DIAYN | diayn |
Competence | Misha | paper | $\log q(\mathbf{w}|\mathbf{z}) + const$ |
APS | aps |
Competence | Hao, Kimin | paper | $r_{t}^{APT}(\mathbf{z}) + \log q(\mathbf{z} | \mathbf{w})$ |
SMM | smm |
Competence | Albert | paper | $\log p^{*}(\mathbf{z}) - \log q_{\mathbf{w}}(\mathbf{z}) - \log p(\mathbf{w}) + \log d(\mathbf{w} | \mathbf{z})$ |
The following 5 RL procedures are available to learn a policy offline from unsupervised data. Replace OFFLINE_AGENT
with Command, for example to use behavioral cloning, set OFFLINE_AGENT
= bc
.
Offline RL Procedure | Command | Paper |
---|---|---|
Behavior Cloning | bc |
paper |
CQL | cql |
paper |
CRR | crr |
paper |
TD3+BC | td3_bc |
paper |
TD3 | td3 |
paper |
The following environments with specific domains and tasks are provided. We also provide a wrapper to convert Gym environments to DMC extended time-step types based on DeepMind's acme wrapper.
Environment Type | Domain | Task | |
---|---|---|---|
Deep Mind Control | walker |
stand , walk , run , flip |
|
Deep Mind Control | quadruped |
walk , run , stand , jump |
|
Deep Mind Control | jaco |
reach_top_left , reach_top_right , reach_bottom_left , reach_bottom_right |
|
Deep Mind Control | cheetah |
run |
run_backward |
Gym Box2D | BipedalWalker-v3 |
walk |
|
Gym Box2D | CarRacing-v1 |
race |
|
Gym Classic Control | MountainCarContinuous-v0 |
goal |
|
Safe Control | SimplePointBot |
goal |
The majority of URLS including the ExORL & URLB based code is licensed under the MIT license, however portions of the project are available under separate license terms: DeepMind is licensed under the Apache 2.0 license.