This issue is there to allow to coordinate who is running what and see a more or less live update of the performances being uploaded to openrlbenchmark.

See all runs: openrlbenchmark

How to help?

Mark your name on an algo/env combination and state the runs as you make them.

Run command with benchmark script:

python benchmark/launch_experiment.py --algo <ALGO> --env-id <ENV_ID> --num-timesteps 1000000 --gamma 0.99 --ref-point ... --auto-tag True --wandb-entity openrlbenchmark --seed <0 to 9> --init-hyperparams ... --train-hyperparams ...

Deterministic envs

For all deterministic environments, we push the learning rate to 1.0 and exploration rate higher since it's all about exploring fast in these cases. Our deterministic envs:

deep-sea-treasure-v0
deep-sea-treasure-concave-v0
four-room-v0
fruit-tree-v0

Multi-policy

✅ CAPQL

[x] --env-id mo-lunar-lander-continuous-v2 --num-timesteps 50000 --ref-point -110 -400 -100 -100 --init-hyperparams "alpha:0.2" 10/10
[x] --env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2" 10/10
[x] --env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "alpha:0.2" 10/10
[x] --env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "alpha:0.2" 10/10

✅ GPI-LS continuous

--algo gpi_ls_continuous

[x] --env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100 --init-hyperparams "per:False" 10/10
[x] --env-id mo-halfcheetah-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False" 10/10
[x] --env-id mo-hopper-2d-v4 --num-timesteps 200000 --ref-point -100 -100 --init-hyperparams "per:False" 10/10
[x] --env-id mo-hopper-v4 --num-timesteps 200000 --ref-point -100 -100 -100 --init-hyperparams "per:False" 10/10

✅ GPI-PD continuous

--algo gpi_pd_continuous

[x] --env-id mo-lunar-lander-continuous-v2 --num-timesteps 200000 --ref-point -110 -400 -100 -100 10/10
[x] --env-id mo-halfcheetah-v4 --num-timesteps 100000 --ref-point -100 -100 10/10
[x] --env-id mo-hopper-2d-v4 --num-timesteps 100000 --ref-point -100 -100 10/10
[x] --env-id mo-hopper-v4 --num-timesteps 100000 --ref-point -100 -100 -100 10/10

✅ GPI-LS discrete

--algo gpi_ls_discrete

[x] --env-id mo-mountaincar-v0 --num-timesteps 200000 --ref-point -200 -200 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id mo-highway-fast-v0 --num-timesteps 200000 --ref-point -1 -1 -40 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "per:False" "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10

✅ GPI-PD discrete

--algo gpi_pd_discrete

[x] --env-id mo-mountaincar-v0 --num-timesteps 50000 --ref-point -200 -200 -200 10/10
[x] --env-id mo-lunar-lander-v2 --num-timesteps 200000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id minecart-v0 --num-timesteps 200000 --gamma 0.98 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10
[x] --env-id mo-highway-fast-v0 --num-timesteps 100000 --ref-point -1 -1 -40 10/10
[x] --env-id mo-reacher-v4 --num-timesteps 200000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:100000" "target_net_update_freq:200" "gradient_updates:10" 10/10

✅ Envelope

--algo envelope

[x] --env-id mo-mountaincar-v0 --num-timesteps 1000000 --ref-point -200 -200 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
[x] --env-id mo-lunar-lander-v2 --num-timesteps 1000000 --ref-point -101 -1001 -101 -101 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
[x] --env-id minecart-v0 --gamma 0.98 --num-timesteps 1000000 --ref-point -1 -1 -200 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
[x] --env-id mo-highway-fast-v0 --num-timesteps 1000000 --ref-point -1 -1 -40 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10
[x] --env-id mo-reacher-v4 --num-timesteps 1000000 --ref-point -50 -50 -50 -50 --init-hyperparams "initial_epsilon:1.0" "final_epsilon:0.05" "epsilon_decay_steps:500000" 10/10

✅ PGMORL

--algo pgmorl

[x] --env-id mo-mountaincarcontinuous-v0 --num-timesteps 3000000 --ref-point -110 -110 10/10
[x] --env-id mo-halfcheetah-v4 --num-timesteps 5000000 --ref-point -100 -100 10/10
[x] --env-id mo-hopper-2d-v4 --num-timesteps 5000000 --ref-point -100 -100 10/10

PCN

--algo pcn

[ ] --env-id mo-mountaincar-v0 --init-hyperparams "scaling_factor:np.array([...]) 0/10
[ ] --env-id mo-lunar-lander-v2 --init-hyperparams "scaling_factor:np.array([...]) 0/10
[ ] `--env-id mo-highway-fast-v0
[ ] `--env-id mo-reacher-v4
[x] --algo pcn --env-id minecart-v0 --gamma 0.98 --ref-point -1 -1 -200 --num-timesteps 10000000 --auto-tag True --wandb-entity openrlbenchmark --seed 0 --init-hyperparams "scaling_factor:np.array([1, 1, 0.1, 0.1])" --train-hyperparams "max_return:1.5" 0/10

✅ PQL (deterministic envs)

--algo pql

[x] --env-id deep-sea-treasure-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])" 10/10 (deterministic env)
[x] --env-id deep-sea-treasure-concave-v0 --num-timesteps 200000 --ref-point 0 -50 --init-hyperparams "ref_point:np.array([0, -50])" 10/10 (deterministic env)
[x] --env-id fruit-tree-v0 --num-timesteps 150000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "ref_point:np.array([-1, -1, -1, -1, -1, -1])" 10/10 (deterministic env)

✅ GPI-LS tabular

--algo gpi-ls --init-hyperparams "use_gpi_policy:True"

[x] --env-id deep-sea-treasure-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)"10/10 (deterministic env)
[x] --env-id deep-sea-treasure-concave-v0 --num-timesteps 400000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --init-hyperparams "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
[x] --env-id fruit-tree-v0 --num-timesteps 400000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id four-room-v0 --num-timesteps 400000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "use_gpi_policy:True" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

✅ MPMOQL

--algo mpmoql

[x] --env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)"10/10 (deterministic env)
[x] --env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
[x] --env-id fruit-tree-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id four-room-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

✅ OLS

--algo ols --init-hyperparams "weight_selection_algo:'ols'" "epsilon_ols:0.0"

[x] --env-id deep-sea-treasure-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id deep-sea-treasure-concave-v0 --num-timesteps 1000000 --gamma 0.99 --ref-point 0 -50 --init-hyperparams "learning_rate:1.0" "weight_selection_algo:'ols'" "epsilon_ols:0.0" "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -2 --train-hyperparams "timesteps_per_iteration:int(1e4)" "num_eval_episodes_for_front:20" 10/10
[x] --env-id fruit-tree-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)
[x] --env-id four-room-v0 resource-gathering-v0 --num-timesteps 1000000 --ref-point -1 -1 -1 --init-hyperparams "learning_rate:1.0" "initial_epsilon:1.0" "epsilon_decay_steps:100000" "final_epsilon:0.1" "weight_selection_algo:'ols'" "epsilon_ols:0.0" --train-hyperparams "timesteps_per_iteration:int(1e4)" 10/10 (deterministic env)

Single-policy

MOQL

[ ] deep-sea-treasure-v0 0/10
[ ] deep-sea-treasure-concave-v0 0/10
[ ] resource-gathering-v0
[ ] fruit-tree-v0 0/10
[ ] four-room-v0 0/10

EUPG

[ ] deep-sea-treasure-concave-v0 0/10
[ ] fishwood-v0 0/10

LucasAlegre / morl-baselines

Performance report issue tracker #43