for each suggestion batch, generate n random initial points
for each initial point, the agent generates m action adjustments as
a function of the hyperparameter values, the previous adjustments,
and the previous reward.
this procedure results in n x m candidates
rank them based on predicted value, and select top n_suggestions
update controller based on rewards of the selected candidates
implement swarm logic for generating suggestions
create a local agent for each of n randomly initialized points
global agent is a metalearning agent that determines which
candidate to select from the m actions for each of n agents.
given the suggestion, anchor observation and adjustment action,
this global agent produces an estimate of the value and a probability
between 0 and 1 to decide whether or not to include the candidate.
vanilla version can be a FFN that produces estimates sequentially,
until n_suggestions have been generated.
the metalearning version would be an RNN that processes the rewards
that decodes the rewards and actions, stopping when n_suggestions
have been generated.
critic loss function would be reward - value estimate error
actor loss function would be log probability of "select candidate"
action.
crazy idea: maybe use the global agent to further train the local
agents by feeding them the reward estimates as the reward for
those candidates that were not selected. This could potentially
introduce a lot of bias into the system if the global agent's value
estimates are really off.
idea: to "refresh" the random initial points, every t turns randomly
perturb them.
basic logic for generating suggestions:
implement swarm logic for generating suggestions