Explain the context of the issue, what is being addressed in detail.
The goal
[x] Higher epoch and datasamples training.
[x] GPS model.
[x] adjustments to GPS block.
[ ] Investigate what causes the policy training to not go above 50% accuracy (is it a model architecture limitation?)
[x] Verify MCTS score calucalted correctly.
[x] Minor MCTS algorithm adjustments.
[x] Increased model base specs
[x] #97
During self-play, instead of chosing max action, they are instead chosen using a temperature metric -> higher temp means choices are more random, lower temp -> more often picks high score actions
[x] Analyze the impact of using a weighted scoring function instead of mean in pure MCTS
[x] ignore training samples with only 1 action choice
[x] check handling of empty samples too
Time tracking
Time Estimate: 0 hours 0 minutes
Time spent: 4 hours 20 minutes
Explain the context of the issue, what is being addressed in detail.
The goal
Time tracking
Time Estimate:
0 hours 0 minutes
Time spent:4 hours 20 minutes
Resources
...