Open Bobingstern opened 2 years ago
If you use the argmax, two agents would play the exact same game over and over again. So it wouldn't be a good benchmark of their performance.
There are a few options that do work instead:
Extra note: a more aggressive temperature setting is used in pit compared with training. So less exploring and more exploiting.
Ah, so it's used for benchmarking. I assume that if you were to deploy it against a human you would use argmax then correct?
Oh, actually, I haven't look at this version of the repo in a while. pit-multi is for benchmarking. pit is for single game tests. So I guess to be optimal against a human you would use argmax. Though you may still want a little bit of randomness at the begining in some games. Otherwise, it might keep going for the same opening. Which could get dull.
In pit.py inside the function n1p, shouldn't the return value be the argmax of the policy rather than a random choice?