HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.33k stars 249 forks source link

PC Benchmarks #832

Open ernestum opened 11 months ago

ernestum commented 11 months ago

This PR contains the changes necessary to run benchmarks for the Preferences Learning algorithm. It is also a place for planing and coordination notes on running the benchmarks.

Right now I think this is the best approach: Start with the slurm-template.sh and manually fill it. Call that tune_on_slurm.sh. Don't use slurm-launch.py. Make the env and the algo a parameter just like with run_benchmark_on_slurm.sh. Add a tune_all_on_slurm.sh just like run_all_benchmarks_on_slurm.sh. Follow this tutorial and this one (note: the way the head node address is determined does not seem to work!).

ernestum commented 10 months ago

After reading through the paper, I am using the following hyperparameter search space:

parameter search space
active_selection True/False
active_selection_oversampling 2 to 10
comparison_queue_size None or 1 to total_comparisons
exploration_frac 0.0 to 0.5
fragment_length 1 to trajectory length
gatherer_kwargs
temperature: 0 to 2
discount_facrtor: 0.95 to 1
sample: True/False
initial_comparison_frac 0.01 to 1
num_iterations 1 to 50
preference_model_kwargs
noise_prob: 0 to 0.1
discount_factor: 0.95 to 1
query_schedule 'constant', 'hyperbolic', 'inverse_quadratic'
total_comparisons 1k (750 were enough in the paper)
total_timesteps 1e7 except for pendulum then 1e6
trajectory_generator_kwargs
exploration_frac: 0 to 0.1
switch_prob: 0.1 to 1
random_prob: 0.1 to 0.9
transition_oversampling 0.9 to 2
policy pick a known good config from the zoo
reward when active_selection is true use the reward_ensemble named config. Otherwise use default. Note the default is just 32x32 while the paper uses 64x64 networks
reward_trainer_kwargs
epochs = 1 to 10
rl pick a known good config from the zoo

I consider fixing active_selection=True and always using the reward ensemble because that turned out best in the paper.