Pit.py policy - Githubissues

Bobingstern commented 2 years ago

In pit.py inside the function n1p, shouldn't the return value be the argmax of the policy rather than a random choice?

bhansconnect commented 2 years ago

If you use the argmax, two agents would play the exact same game over and over again. So it wouldn't be a good benchmark of their performance.

There are a few options that do work instead:

start with some sort of opening book to get the agents to a unique sub game before using argmax for the rest of the turns
play totally randomly for the first few turns before switching to the argmax agent.
play with some sort of randomness using temperature to make the agents generally pick better moved. Especially towards the end of the game. (This is what is happening in this repo). The agents will still be heavily weighted to the best move.

Extra note: a more aggressive temperature setting is used in pit compared with training. So less exploring and more exploiting.

Bobingstern commented 2 years ago

Ah, so it's used for benchmarking. I assume that if you were to deploy it against a human you would use argmax then correct?

bhansconnect commented 2 years ago

Oh, actually, I haven't look at this version of the repo in a while. pit-multi is for benchmarking. pit is for single game tests. So I guess to be optimal against a human you would use argmax. Though you may still want a little bit of randomness at the begining in some games. Otherwise, it might keep going for the same opening. Which could get dull.

bhansconnect / fast-alphazero-general

Pit.py policy #4