fidelity / mabwiser

[IJAIT 2021] MABWiser: Contextual Multi-Armed Bandits Library
https://fidelity.github.io/mabwiser/
Apache License 2.0
213 stars 42 forks source link

There's no good way of getting the rewards of arms, period. #88

Closed aberges-grd closed 10 months ago

aberges-grd commented 1 year ago

I saw #86 and how it was marked as "closed" by a developer after suggesting we access a private field of the MAB class in order to get the rewards of e.g. EpsilonGreedy rewards, because the original method predict_expectations has a random chance to return random rewards.

As I already commented, this is not only bad practice, it also (predictably) leads to broken behavior (that is not trivial to workaround anymore) when using it on nonparametric contextual bandits.

My suggestions are either rewrite the method predict_expectations so that it only returns the expectations with no extra rng, or (in case this RNG was implemented for some theoretical reason) add a get_expectations method that returns the values.

skadio commented 1 year ago

(replied in another thread and posting it here as well as a reference for others who might come across this)

==

Thank you for sharing your thoughts -- it is great to see a passionate user base of our library who deeply cares about implementation details and best practices.

Everyone seems on agreement that the right way to access the expectations of arms is via predict_expectations() method. This should suffice in most use cases for most users, if not always.

One subtle thing to point out is the below is not always true.

The rng part should only apply when predicting arms, not rewards.

Randomness plays a part in certain Learning Policies. Consider for instance Thompson Sampling. By definition, the rewards/expectations are sampled. This is not an implementation or design choice. It is dictated by the algorithm, as it is how it works. For instance, if we were to call predict_expectations() twice for a non-contexual TS bandit, we will receive two different outputs, which is the expected behavior of the algorithm to make different decisions at a decision step in the sequence.

A higher-level comment on how to benefit from predict_expectations(). The main idea/motivation behind making this available in our public API is to enable customized decision policies for different applications while off-loading the mechanics of calculating expectations to the library.

In our IJAIT'21 and ICTAI'19 papers, we show how to design custom applications. A simple example is where we want a decision policy that makes an arm-selection by considering expectations but also the cost incurred by each arm. One can easily write a custom get_decision() method that wraps around predict_expectations() and implements applications specific decisions.

Hope this clarifies, thanks again for your comments. Please consider a GitHub Star to spread the word.