Direction agreed upon in a meeting

dimitri-rusin commented 6 months ago

Change Rust code so that we can give to it an extra parameter which controls how many bits are expected to be set to 1 in the initial point.
Give a smaller cutoff value to the Rust code and show how many episodes have been cutoff in the Policy Improvement Plot in the hover menu.
Overlay the policy displayed in the second diagram with the theoretical policy.
Look into whether you can retrieve policies for certain OneMax dimensionalities. The policies might have been computed via irace or CMA-ES.
Use another metric for the x axis of the Policy Improvement Plot: the "cost", or number of function evaluations needed to arrive at a policy.
Visualize the variance of the theoretical policy.
Add a vertical line representing the initial fitness in the Fitness-Lambda Assignment Plot. The initial fitness should be taken in expectation across all episodes used for training since the last checkpoint.
Run an experiment with a small dimensionality using PPO.
Run several experiments covering combinations of different dimensionalities PLUS different initial fitnesses.
Run three experiments with the same settings except for gamma: 0.9900, 0.9950, 0.9998.
Compute "number of function evaluations until optimum" using slow Python as well as using slow Rust as a sanity check.

dimitri-rusin commented 6 months ago

Maybe, switch from variance to 95% confidence interval.

dimitri-rusin commented 6 months ago

tabular Q learning, PPO (with continuous lambda; let Nguyen know the PPO setting before running) dimensions 50, 500, 1.000 timesteps 1,000,000 gamma 0.9900, 0.9950, 0.9998 starting from uniformly sampled solution, good solution (with Rust evaluating starting from good solution)

Hypotheses:

tabular q learning at 1,000 dimensions after 1,000,000 timesteps, still barely reaches the baseline, or does not reach at all
for smaller, tabular q learning reaches the baseline policy
PPO is better for larger dimensionalities than Q learning, at reaching the baseline

Step 1: on different dimensions, Q learning Step 2: on good solution, Q learning Step 3: on different gamma, Q learning Step 4: on different dimensions, PPO Step 5: on good solution, PPO Step 6: on different gamma, PPO

Description of experiment add into Flask visualization

dimitri-rusin commented 6 months ago

Change 'evaluation_interval' to be in timesteps.
PPO from stable-baselines3 [changing gamma 0.99 vs. 0.9998]
Jupyter Notebook, (plot databases) + PANDAS
- each [See multiple plots at the same based on our choices.]
- across [some metric across all settings]

dimitri-rusin commented 5 months ago

We have some results of PPO on an instance with dimensionality 80. So, it looks like PPO does not improve over time. But it starts off pretty close to the theory-derived policy.

22 03 2024_11 36 52_REC

dimitri-rusin commented 5 months ago

All of the policies I checked that start at number of timesteps of about 40,000 are all assigning lambda = 1 to every fitness.

22 03 2024_11 40 04_REC

dimitri-rusin commented 5 months ago

This is done using this configuration:

  model = PPO(
    "MlpPolicy",
    vec_env,
    policy_kwargs={'net_arch': [256, 256], 'activation_fn': torch.nn.ReLU},
    learning_rate=0.0003,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    vf_coef=0.5,
    ent_coef=0.0,
    clip_range=0.2,
    verbose=1,
  )

Nguyen, what are good configuration values to try instead of the above?

dimitri-rusin / oll_onemax

Direction agreed upon in a meeting #8