dimitri-rusin / oll_onemax

1 stars 0 forks source link

Direction agreed upon in a meeting #8

Open dimitri-rusin opened 6 months ago

dimitri-rusin commented 6 months ago
dimitri-rusin commented 6 months ago

Maybe, switch from variance to 95% confidence interval.

dimitri-rusin commented 6 months ago

tabular Q learning, PPO (with continuous lambda; let Nguyen know the PPO setting before running) dimensions 50, 500, 1.000 timesteps 1,000,000 gamma 0.9900, 0.9950, 0.9998 starting from uniformly sampled solution, good solution (with Rust evaluating starting from good solution)

Hypotheses:

Step 1: on different dimensions, Q learning Step 2: on good solution, Q learning Step 3: on different gamma, Q learning Step 4: on different dimensions, PPO Step 5: on good solution, PPO Step 6: on different gamma, PPO

Description of experiment add into Flask visualization

dimitri-rusin commented 6 months ago
  1. Change 'evaluation_interval' to be in timesteps.
  2. PPO from stable-baselines3 [changing gamma 0.99 vs. 0.9998]
  3. Jupyter Notebook, (plot databases) + PANDAS
    • each [See multiple plots at the same based on our choices.]
    • across [some metric across all settings]
dimitri-rusin commented 5 months ago

We have some results of PPO on an instance with dimensionality 80. So, it looks like PPO does not improve over time. But it starts off pretty close to the theory-derived policy.

22 03 2024_11 36 52_REC

dimitri-rusin commented 5 months ago

All of the policies I checked that start at number of timesteps of about 40,000 are all assigning lambda = 1 to every fitness.

22 03 2024_11 40 04_REC

dimitri-rusin commented 5 months ago

This is done using this configuration:

  model = PPO(
    "MlpPolicy",
    vec_env,
    policy_kwargs={'net_arch': [256, 256], 'activation_fn': torch.nn.ReLU},
    learning_rate=0.0003,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    vf_coef=0.5,
    ent_coef=0.0,
    clip_range=0.2,
    verbose=1,
  )

Nguyen, what are good configuration values to try instead of the above?