Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms

bing-j commented 6 months ago

Hello! I wrote some strategies that use armed bandit algorithms. Originally, I only wanted to implement the epsilon-greedy strategy, but I now plan on extending this effort and implementing all the algorithms mentioned in the multi-armed bandit chapter of Sutton's Reinforcement Learning: an Introduction (I added the reference to the bibliography). So the branch name is no longer very representative; I added both Greedy and EpsilonGreedy on this branch.

Greedy: Always chooses the action that has the highest average/expected "reward" (score), calculated from its own previous turns. The reward function is updated incrementally and optionally recency weighted, and initial expected rewards of each action default to zero if not modified through a parameter.

EpsilonGreedy: Mostly works like Greedy (with p=1-e), but sometimes acts randomly (with p=e).

These strategies are described in detail in the textbook mentioned above as well.

As I've mentioned on gitter, I was unable to find any strategies that implement these algorithms, although I did find some similar ones. For example, Adaptive() works similarly to Greedy() without weights, but has a hard coded initial sequence, and uses raw sum of scores to choose the optimal play instead of average score. (Side note: the comments in Adaptive().strategy() indicate that it was intended to use the highest average; this may be an error in the code!) If similar strategies already exist, and/or there's any modifications I need to make in the code, please let me know!

Cheers :)

marcharper commented 6 months ago

Looks like we broke the test invocator with some recent commits, I'll try to fix it. You'll need to update one of the doc tests to indicate that two new strategies have been added.

marcharper commented 5 months ago

Thanks for the updates. The test that's failing is:

======================================================================
FAIL: test_strategy (axelrod.tests.strategies.test_meta.TestNMWEDeterministic.test_strategy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_meta.py", line 636, in test_strategy
    self.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 580, in versus_test
    test_match.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 665, in versus_test
    self.assertEqual((i, play), (i, expected_play))
AssertionError: Tuples differ: (2, D) != (2, C)

First differing element 1:
D
C

- (2, D)
?     ^ 

+ (2, C)
?     ^

This is happening because there are some ensemble strategies and the behavior of one of them has changed with the addition of these new strategies. You can run these tests with something like

python -m unittest axelrod.tests.unit.test_meta.py

I think in this case you just need to update the expected output that has changed now.

marcharper commented 3 months ago

Hi @bing-j, if you rebase onto the dev branch now the failing test should pass.

Axelrod-Python / Axelrod

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438