Ludeme / LudiiExampleAI

Project with examples for the implementation of third-party AI algorithms / agents for the Ludii general game system.
MIT License
26 stars 15 forks source link

DUCT example incorrectly does not use robust child move selection strategy #7

Closed EklipZgit closed 12 months ago

EklipZgit commented 1 year ago

hello! I have been using this project as an example of implementing SM-MCTS with DUCT in python. Working through some issues I was having, I noticed the finalMoveSelection function in https://github.com/Ludeme/LudiiExampleAI/blob/master/src/mcts/ExampleDUCT.java DECLARES that it uses the robust child strategy, but then instead plays the combined move with the highest average score instead, which presumably means the algorithm is now assuming the opponent makes the worst move possible, instead of best, as it is treating that combined move as if it is 100% under the players control instead of Simultaneous Move.

I checked the UCT implementation https://github.com/Ludeme/LudiiExampleAI/blob/master/src/mcts/ExampleUCT.java and it DOES implement robust child (highest visit count), so I can only assume that DUCT also meant to do the same and is just incorrect.

Already fixing on my end in my python adaptation but figured I'd let you know!

DennisSoemers commented 1 year ago

You're right that there is an inconsistency between the code and the comments, but I'm not sure yet whether we should change the code or the comments...

but then instead plays the combined move with the highest average score instead, which presumably means the algorithm is now assuming the opponent makes the worst move possible, instead of best, as it is treating that combined move as if it is 100% under the players control instead of Simultaneous Move.

This is not exactly what's happening. The entire point of decoupled UCT is the decoupling: we just collect average scores for our own actions only, decoupled from any opponents' actions. So, these scores are not assuming that the opponent makes the worst move possible, but rather assuming the average opponent behaviour across all simulations where we happened to select this particular action.

I have never really done much research with simultaneous-move games myself, so I'm actually not 100% sure whether DUCT works better with maximising visits or maximising scores. At the end of Section 3.1 of "Monte Carlo Tree Search in Simultaneous Move Games with Applications to Goofspiel" by Lanctot, Lisý and Winands, they actually do go for the maximum average score in the DUCT case (but mention visit counts as an alternative), whereas they go for a visits-based solution in other algorithms. So, probably there is a good reason to prefer scores over visits in DUCT. But I'm not sure yet what that reason would be...

EklipZgit commented 12 months ago

see the other issue I just filed #8 , I BELIEVE (I haven't run Ludii implementation, and following the Move implementation was difficult) but I BELIEVE the DUCT implementation is incorrectly giving the chooser agency over the opponents responses by including the opponent responses in the set of legal 'moves' per player. It CERTAINLY was in my port, and it doesn't seem like I ported incorrect based on how I read the Move class implementation and how you iterate in finalMoveSelection.

I have not tried switching this back to the value selection yet now that I fixed that issue, but certainly if your implementation does indeed have the issue I detail in the other bug, then the value selection outputs incredibly bad results because it further minimizes the agency of the opponent.

I suspect now that I fixed that issue in my implementation (which went from outputting this clearly wrong solution image to this correct solution image ) I imagine I can revert the robust child implementation and give the value based selection a try. I suspect you're right and it will work fine now that the opponents decisions are completely decoupled from the chooser after the other fix.

I'll give it a shot and report back here after I do some testing of robust child vs maximizing average score.

EklipZgit commented 12 months ago

Yeah, max average value selection function is outputting fine results as well with the other bugs fixed.