PC Benchmarks - Githubissues

HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms

MIT License

1.33k stars 249 forks source link

This PR contains the changes necessary to run benchmarks for the Preferences Learning algorithm. It is also a place for planing and coordination notes on running the benchmarks.

[x] Do a test run on astar to see if everything runs without errors.
[x] ~~Figure out how to properly run the tuning script on SLURM.~~ Decided to not go the trouble with SLURM for now. It is too much trouble for too little gain. Maybe with slurm-launch.py and slurm-template.sh.

Right now I think this is the best approach: Start with the slurm-template.sh and manually fill it. Call that tune_on_slurm.sh. Don't use slurm-launch.py. Make the env and the algo a parameter just like with run_benchmark_on_slurm.sh. Add a tune_all_on_slurm.sh just like run_all_benchmarks_on_slurm.sh. Follow this tutorial and this one (note: the way the head node address is determined does not seem to work!).

[ ] Figure out what would be a good HP search space.
[ ] Run the tuning scripts

parameter	search space
active_selection	True/False
active_selection_oversampling	2 to 10
comparison_queue_size	None or 1 to total_comparisons
exploration_frac	0.0 to 0.5
fragment_length	1 to trajectory length
gatherer_kwargs
	temperature: 0 to 2
	discount_facrtor: 0.95 to 1
	sample: True/False
initial_comparison_frac	0.01 to 1
num_iterations	1 to 50
preference_model_kwargs
	noise_prob: 0 to 0.1
	discount_factor: 0.95 to 1
query_schedule	'constant', 'hyperbolic', 'inverse_quadratic'
total_comparisons	1k (750 were enough in the paper)
total_timesteps	1e7 except for pendulum then 1e6
trajectory_generator_kwargs
	exploration_frac: 0 to 0.1
	switch_prob: 0.1 to 1
	random_prob: 0.1 to 0.9
transition_oversampling	0.9 to 2
policy	pick a known good config from the zoo
reward	when active_selection is true use the reward_ensemble named config. Otherwise use default. Note the default is just 32x32 while the paper uses 64x64 networks
reward_trainer_kwargs
	epochs = 1 to 10
rl	pick a known good config from the zoo

parameter

search space

active_selection

True/False

active_selection_oversampling

2 to 10

comparison_queue_size

None or 1 to total_comparisons

exploration_frac

0.0 to 0.5

fragment_length

1 to trajectory length

gatherer_kwargs

temperature: 0 to 2

discount_facrtor: 0.95 to 1

sample: True/False

initial_comparison_frac

0.01 to 1

num_iterations

1 to 50

preference_model_kwargs

noise_prob: 0 to 0.1

discount_factor: 0.95 to 1

query_schedule

'constant', 'hyperbolic', 'inverse_quadratic'

total_comparisons

1k (750 were enough in the paper)

total_timesteps

1e7 except for pendulum then 1e6

trajectory_generator_kwargs

exploration_frac: 0 to 0.1

switch_prob: 0.1 to 1

random_prob: 0.1 to 0.9

transition_oversampling

0.9 to 2

policy

pick a known good config from the zoo

reward

when active_selection is true use the reward_ensemble named config. Otherwise use default. Note the default is just 32x32 while the paper uses 64x64 networks

reward_trainer_kwargs

epochs = 1 to 10

pick a known good config from the zoo

HumanCompatibleAI / imitation

PC Benchmarks #832