Only running each pairing once could result in too much randomness affecting results

valadaptive commented 3 years ago

The current prisonersDilemma.py script only pits each algorithm against each other once, meaning that randomness (both in the algorithms themselves and in the round length) could significantly affect the results. A fix for this would be to run each round many times over.

MarcHenriot commented 3 years ago

That's right, I modified the runFullPairingTournament function for those who are interested in this kind of technique.

def runFullPairingTournamentN(inFolder=STRATEGY_FOLDER, n=1000):
  finalScoreKeeper = {}
  STRATEGY_LIST = []

  for file in os.listdir(inFolder):
      if file.endswith(".py"):
          STRATEGY_LIST.append(file[:-3])

  for strategy in STRATEGY_LIST:
      finalScoreKeeper[strategy] = 0

  for _ in tqdm(range(n)):
      scoreKeeper = {}
      for strategy in STRATEGY_LIST:
          scoreKeeper[strategy] = 0

      for pair in itertools.combinations(STRATEGY_LIST, r=2):
          roundHistory = runRound(pair)
          scoresA, scoresB = tallyRoundScores(roundHistory)
          scoreKeeper[pair[0]] += scoresA
          scoreKeeper[pair[1]] += scoresB

      for strategy in STRATEGY_LIST:
          finalScoreKeeper[strategy] += scoreKeeper[strategy]

  scoresNumpy = np.zeros(len(finalScoreKeeper))
  for i in range(len(STRATEGY_LIST)):
      scoresNumpy[i] = finalScoreKeeper[STRATEGY_LIST[i]]
  rankings = np.argsort(scoresNumpy)

  for rank in range(len(STRATEGY_LIST)):
      i = rankings[-1-rank]
      score = scoresNumpy[i]
      scoreTrour = score/n
      scorePer = scoreTrour/(len(STRATEGY_LIST)-1)
      print(f'#{pad(str(rank+1)+":", 3)} {pad(STRATEGY_LIST[i]+":", 18)} {score:f} ({scoreTrour:f} avg per tournament) ({scorePer:f} avg per match)')

nobody5050 commented 3 years ago

I’ve been using a modified version of that function myself for testing, but this one looks much cleaner. Thanks!

Bip901 commented 3 years ago

Agreed; My submission consistently ranks 1st when running a few times and averaging the results, but sometimes ranks as low as 3rd when running only once.

carykh commented 3 years ago

Thanks for pointing this out! Yeah, quite a few people noticed how much randomness had an effort on the leaderboard's ordering. I will probably do something like @peopleOfPalay's edit where I run the tournament over and over and take the average of all of them. I'll try to do 1,000 iterations, but I hope the code doesn't take too long to run! (Since it is O(n^2) for the number of submissions)

valadaptive commented 3 years ago

You could take advantage of multiprocessing. I can create a PR for that if you want!

TimStanglow commented 3 years ago

Well, as the number of matches each algorith plays increases the actual effect of randomness goes down. So the more participants there are the less iterations are needed. But since we're already talking performance, you could accept my Pull Request that improves it significatly.

EFHIII commented 3 years ago

For matches between two strategies that don't use random, repeated matches will always have the same outcome, so you could try, for example, running matches 10 times, and if the first 200 moves are the same for both players all 10 times, you can take the average from just those 10, otherwise you know RNG played a part and you can then repeat 990 more times.

This would significantly reduce the time it takes to run without significantly changing the results compared to running it 1000 times.

There are a few exceptions where strategies using RNG could falsely be flagged as not using RNG using the above, for example, strategies that only use RNG after the 200th move or strategies that have a small chance of RNG of changing things; maybe 0.05% of the time, they defect on the 1st move or something.

Another method you could use is checking of the source code includes the word "random" since they're probably not using RNG if they don't include random. This would instead have false positives instead of false negatives, which might be preferred. It's still probably a good idea to run matches not using RNG a few times since the match length is random, and that can effect the score.

duckboycool commented 3 years ago

For matches between two strategies that don't use random, repeated matches will always have the same outcome, so you could try, for example, running matches 10 times, and if the first 200 moves are the same for both players all 10 times, you can take the average from just those 10, otherwise you know RNG played a part and you can then repeat 990 more times.

This would significantly reduce the time it takes to run without significantly changing the results compared to running it 1000 times.

There are a few exceptions where strategies using RNG could falsely be flagged as not using RNG using the above, for example, strategies that only use RNG after the 200th move or strategies that have a small chance of RNG of changing things; maybe 0.05% of the time, they defect on the 1st move or something.

Another method you could use is checking of the source code includes the word "random" since they're probably not using RNG if they don't include random. This would instead have false positives instead of false negatives, which might be preferred. It's still probably a good idea to run matches not using RNG a few times since the match length is random, and that can effect the score.

I think it would be best to check for whether the random module was imported in the strategy. While this could lead to false negatives if somebody uses randomness not from the random module (such as numpy.random), you could potentially fix this by adding the import random to any strategies that used randomness without a random import, since all the files have to be checked anyway. You could also remove random imports if a strategy doesn't use it to decrease false positive rates.

After removing unnecessary imports, I found that a basic cache with a dictionary gave a 3-4x speedup for multiple iterations on my test set with a few dozen strategies.

nobody5050 commented 3 years ago

For matches between two strategies that don't use random, repeated matches will always have the same outcome, so you could try, for example, running matches 10 times, and if the first 200 moves are the same for both players all 10 times, you can take the average from just those 10, otherwise you know RNG played a part and you can then repeat 990 more times. This would significantly reduce the time it takes to run without significantly changing the results compared to running it 1000 times. There are a few exceptions where strategies using RNG could falsely be flagged as not using RNG using the above, for example, strategies that only use RNG after the 200th move or strategies that have a small chance of RNG of changing things; maybe 0.05% of the time, they defect on the 1st move or something. Another method you could use is checking of the source code includes the word "random" since they're probably not using RNG if they don't include random. This would instead have false positives instead of false negatives, which might be preferred. It's still probably a good idea to run matches not using RNG a few times since the match length is random, and that can effect the score.

I think it would be best to check for whether the random module was imported in the strategy. While this could lead to false negatives if somebody uses randomness not from the random module (such as numpy.random), you could potentially fix this by adding the import random to any strategies that used randomness without a random import, since all the files have to be checked anyway. You could also remove random imports if a strategy doesn't use it to decrease false positive rates.

After removing unnecessary imports, I found that a basic cache with a dictionary gave a 3-4x speedup for multiple iterations on my test set with a few dozen strategies.

Some strategies might import random but not actually use it though

duckboycool commented 3 years ago

Some strategies might import random but not actually use it though

You could also remove random imports if a strategy doesn't use it to decrease false positive rates.

After removing unnecessary imports, I found that a basic cache with a dictionary gave a 3-4x speedup

Bip901 commented 3 years ago

Guys, don't forget that the game length is random. That can have an effect even on a match between two non-random strategies. You can't do that optimization.

EFHIII commented 3 years ago

Guys, don't forget that the game length is random. That can have an effect even on a match between two non-random strategies. You can't do that optimization.

I didn't forget, I mentioned it at the end;

It's still probably a good idea to run matches not using RNG a few times since the match length is random, and that can effect the score.

The difference that makes compared to other forms of RNG is probably going to be far less noticeable. Historically, a lot of strategies will cooperate until defected against, and in that situation, the score is 3 for both sides no matter how long the match is. If you want to be as cautious about that type of RNG as possible, you could check if across a few rounds the score for both stays within a small range (+/- 3% or so) and if it does, it's pretty safe to just average from there and move on.

duckboycool commented 3 years ago

If you want to be completely sure that game length randomness is the same, what you could do cache the entire game history and memories, then generate more of the round when the game length is longer than the length of the history stored in cache, and just return the first amount of rounds from it when it isn't. This should produce exactly the same results as outright running the strategies the full amount of times, and would be quicker than rerunning even just a few times.

This could also start running into issues with memory on a larger set, but if this becomes a problem you could try to store the data in a way where each value only uses 1 bit instead of 32.

EFHIII commented 3 years ago

If Cary plans to run the code multiple times, yes, building a cache is ideal, but otherwise, it's a waste of time and resources.

In the fork I'm working on, we're implementing a cache, but I see no need for Cary to use one if he only plans to run things once.

duckboycool commented 3 years ago

If Cary plans to run the code multiple times, yes, building a cache is ideal, but otherwise, it's a waste of time and resources.

In the fork I'm working on, we're implementing a cache, but I see no need for Cary to use one if he only plans to run things once.

They won't just be run once though.

I will probably do something like peopleOfPalay's edit where I run the tournament over and over and take the average of all of them.

Also, as per the idea I said before, continuing a game from cached history could be problematic for consistency, so it would probably be better to precompute the cache to 500 turns or so once, and then run it from the start again in the unlikely case that the game's longer than that.

carykh / PrisonersDilemmaTournament

Only running each pairing once could result in too much randomness affecting results #17