Open dimitri-rusin opened 6 months ago
Maybe, switch from variance to 95% confidence interval.
tabular Q learning, PPO (with continuous lambda; let Nguyen know the PPO setting before running) dimensions 50, 500, 1.000 timesteps 1,000,000 gamma 0.9900, 0.9950, 0.9998 starting from uniformly sampled solution, good solution (with Rust evaluating starting from good solution)
Hypotheses:
Step 1: on different dimensions, Q learning Step 2: on good solution, Q learning Step 3: on different gamma, Q learning Step 4: on different dimensions, PPO Step 5: on good solution, PPO Step 6: on different gamma, PPO
Description of experiment add into Flask visualization
We have some results of PPO on an instance with dimensionality 80. So, it looks like PPO does not improve over time. But it starts off pretty close to the theory-derived policy.
All of the policies I checked that start at number of timesteps of about 40,000 are all assigning lambda = 1 to every fitness.
This is done using this configuration:
model = PPO(
"MlpPolicy",
vec_env,
policy_kwargs={'net_arch': [256, 256], 'activation_fn': torch.nn.ReLU},
learning_rate=0.0003,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
vf_coef=0.5,
ent_coef=0.0,
clip_range=0.2,
verbose=1,
)
Nguyen, what are good configuration values to try instead of the above?