Bias in Tie-Breaking for Equivalent Arms in EpsilonGreedy and Linear MABs

fidelity / mabwiser

[IJAIT 2021] MABWiser: Contextual Multi-Armed Bandits Library

Apache License 2.0

213 stars 42 forks source link

Hi, thanks for the excellent mabwiser package!

I’m experimenting with delayed rewards and have noticed that some MAB algorithms show bias when selecting between equivalent arms, i.e., when splitting ties.

For instance, with two new arms (no prior rewards), the Random policy selects one arm uniformly at random, as expected. However, EpsilonGreedy uses utils.argmax on the expected rewards (both initially 0), which always selects the first arm due to Python’s max behavior, instead of choosing between the two arms uniformly at random. This issue arises whenever arms have equivalent expected rewards: the first arm is always selected.

While this might not seem critical, the bias becomes pronounced with delayed rewards — for example, when rewards are only received after 1000 decisions. During this initial phase, the performance (accuracy or regret) of EpsilonGreedy differs significantly from that of the Random policy, although they should be identical. Even in later stages, always selecting the first arm among those with the highest reward leads to odd behavior.

The same issue occurs with Linear MABs, as they rely on np.argmax, which doesn’t handle ties impartially: np.argmax() consistently returns the first index among maximum values. Consequently, using LinGreedy in the scenario described above yields similar biased results.

I’m working on a pull request to address these issues (see this branch), but many tests in mabwiser currently fail with my changes. It’s possible I’ve made an error, or the tests might need adjustments. Before proceeding further, I’d like to start a conversation about this problem.

PS: I’m working on creating a small, reproducible example to illustrate the issue, which could serve as the basis for a new unit test.

Hi @emanuele, thank you for the positive remarks and pointing this out.

We do rely on standard Python max() operation which returns the first index. This might work for most cases but not all, as in the scenario you mentioned above.

In my mind, the easiest first thing would be to implement a customized policy.

In this case, for a customized mab, you inherit from the policy you want (here, epsilon greedy), and then overwrite the predict() method for the desired behavior. In this scenario, instead of the max, we return the min regret by looking at the diff between the top two max. You can implement any custom behavior as needed. Under the hood, everything is calculated as before, and we retrieve expectations via predict_expectations() to decide what to do --in your case, a random tie breaking among the max entries.

In future, a random tie breaking might be considered as a configuration, but as you noted this requires non-trivial fixing of the tests as well as how to design the API for such a config. Possibly doable but the issue would be compatibility with the behavior of systems where max-first is now baked into downstream systems. The ability to wrap a custom behavior might be a lot easier in short term.

Hope this makes sense, Serdar

fidelity / mabwiser

Bias in Tie-Breaking for Equivalent Arms in EpsilonGreedy and Linear MABs #96