In the UCB1 and UCB1-tuned algorithm implementations, the calculation of the upper confidence bound should include the term ( sqrt( 2.0 * ( ln (bandit_trials ) / arm_trials ) ).
The implementation here flips (bandit_trials) and (arm_trials) in the calculation of this term. The risk here is that arms without trials (or arms with very few pulls) aren't reevaluated as often, and the bandit has a higher risk of selecting a non-optimal arm early on.
In the UCB1 and UCB1-tuned algorithm implementations, the calculation of the upper confidence bound should include the term ( sqrt( 2.0 * ( ln (bandit_trials ) / arm_trials ) ).
The implementation here flips (bandit_trials) and (arm_trials) in the calculation of this term. The risk here is that arms without trials (or arms with very few pulls) aren't reevaluated as often, and the bandit has a higher risk of selecting a non-optimal arm early on.
https://github.com/Yelp/MOE/blob/master/moe/bandit/ucb/ucb1.py#L64-L65 https://github.com/Yelp/MOE/blob/master/moe/bandit/ucb/ucb1_tuned.py#L78-L79 (sorry, not sure how to insert a gist into this issue)
Reference: http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf, pages 3 & 11.
The algorithm is also stated correctly in the docstrings under the method, just implemented wrong. Is there a specific reason for this?
I have a pull request + test ready to go if this is a genuine issue.