bhansconnect / fast-alphazero-general

A clean implementation based on Expert Iterations for any game, inspired by alpha-zero-general
MIT License
42 stars 8 forks source link

Pit.py policy #4

Open Bobingstern opened 2 years ago

Bobingstern commented 2 years ago

In pit.py inside the function n1p, shouldn't the return value be the argmax of the policy rather than a random choice?

bhansconnect commented 2 years ago

If you use the argmax, two agents would play the exact same game over and over again. So it wouldn't be a good benchmark of their performance.

There are a few options that do work instead:

Extra note: a more aggressive temperature setting is used in pit compared with training. So less exploring and more exploiting.

Bobingstern commented 2 years ago

Ah, so it's used for benchmarking. I assume that if you were to deploy it against a human you would use argmax then correct?

bhansconnect commented 2 years ago

Oh, actually, I haven't look at this version of the repo in a while. pit-multi is for benchmarking. pit is for single game tests. So I guess to be optimal against a human you would use argmax. Though you may still want a little bit of randomness at the begining in some games. Otherwise, it might keep going for the same opening. Which could get dull.