[metalearn] neurips bbo challenge idea dump

cosmicBboy commented 4 years ago

Noting these down for the neurips bbo challenge

idea 1: generate more suggestions and only send the top n_suggestions ranked by value.
idea 2: generate n_suggestions and set the reward for the lowest m suggestions to -1 (this should happen in the observe step)
idea 3: combine idea 1 and 2 - generate x factor more suggestions than n_suggestions and send top n_suggestions ranked by value. Use all of the suggestions to update the controller, setting the rewards for the suggestions that didn't make it to -1 (this should happen in the suggest step)
[X] idea 4: in meta-ml package, use a nn.ModuleDict to name micro actions by algorithm/hyperparameter name. This enables the addition of arbitrary new hyperparameters while preserving the weights of the old hyperparameters #24.
~~idea 5: reward function engineering: keep track of the running min and max reward over the entire run, normalizing the reward for each batch to be betwee -1 (min) and 1 (max)~~
[X] idea 6: use continuous action space for selecting real #23 hyperparameters within the bounds specified by api_config:
- https://medium.com/@asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c # noqa
- use the normal distribution: https://pytorch.org/docs/stable/distributions.html#normal
idea 7: implement trust region policy optimization (TRPO):
- https://arxiv.org/pdf/1502.05477.pdf
- code: https://github.com/ikostrikov/pytorch-trpo
idea 8: implement proximal policy optimization (PPO) #25:
- https://arxiv.org/abs/1707.06347
- code: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail

cosmicBboy commented 4 years ago

idea 9: use Random Network Distillation, applied to the value of the next time step

cosmicBboy commented 4 years ago

idea 10: try Q-actor critic method instead of advantage function

cosmicBboy commented 4 years ago

idea 11: use simpler policy architecture, with multivariate normal to jointly produce all hyperparameters instead of sequentially with an RNN

cosmicBboy commented 4 years ago

idea 12: use model-based RL to estimate the reward function (function approximator can even be gaussian process!)

cosmicBboy / ml-research

[metalearn] neurips bbo challenge idea dump #26