Questions about your Implementation

Casper-Guo commented 1 month ago

Hi, I am working on a similar project and @sothatsit pointed me to your repo. I have looked over it, mostly for optimization inspirations and have a few questions for you.

How long does it take to run the full BFS?
You chose to represent the board in 32 bits with some extra arithmetic (different number of bits for safe/unsafe squares and converting trit to binary etc.). I am using a simpler 64-bit representation (see here) and the gain is much simpler board logic. What were your considerations for making the space/time tradeoff that you did.
Related to 2, you re-compute neighbors at every iteration which seems time-intensive. Did you consider caching the result, and if so, what drove you away from that?
I see you allocated 6 hours wall time for this job. How many iterations did you run and were you able to achieve convergence?
You do not discount your rewards. What is your thinking around that?
It is known that even a crude heuristic for value initialization acceleration convergence. Did you try anything to that end?

Thank you for your time. And great work! Looking forward to any publication

JeroenOlieslagers commented 1 month ago

Hey Casper, cool to hear more interest in RGU!

It depends on the device, but for the full 137M states I think around 10min? I save the results in a ~1GB file which means loading it only takes seconds.
I will elaborate more on this below, but basically you don’t lose much speed but gain a factor of 2 on memory. Definitely worth the tradeoff in my opinion. It's quite a nice coincidence that RGU just fits within 32 bits, so why not take advantage of it?
I tried storing the full graph, which is viable on RAM, but the lookup is actually more time intensive than recomputing the neighbours (and because states don't always have the same number of neighbours, things get a bit ugly in terms of dynamic memory). The way I've implemented the neighbour function results in very few memory allocs so it scales well. The neighbour function carries most of the weight in the BFS, but it's the function I've spent most time optimizing. I think even though I’m dealing with trits etc, using a 64-bit representation would not speed things up more than 10% is my guess, while requiring 100% more memory. (These are hypotheses, I have not implemented a 64 bit version). To be honest, I’m not really memory limited, so it might be worth exploring this! A large enough speed increase could justify the double memory requirement.
I allocated 6h, but it only took 1.5h in the end. The number of iterations varied depending on the subset of the state space I was looking at. There were around 27 subdivisions and I think each required maybe 100 iteration? I should really look this up and get back to you, but if you’re not dividing up your state space, it will take more than 1,000 iteration to converge within 0.001 of the true value. This, however, will take much longer than 1.5h. To answer your last question, yes, we were able to get convergence to deviations smaller than 0.000001. Once you come near this threshold, deviations become small quick, and the policy doesn't change I think.
RGU is episodic, so no need for any discount factor. In theory a game can go on forever, but the probability of this is zero. (I’m not a mathematician but there's something here about RGU "almost surely" being episodic or something). The other nice thing about not using a discount factor is that the values of your states become interpretable. You can convert value directly into how likely the player is to win.
Very good point! The subdivisions of the state space actually do this. In short, you do value iteration (VI) for the states where each player only has one piece until convergence. Then, you look at the states with two pieces, and use the value from before as starting point. This would be similar to using a heuristic that would give higher initial value to states where you have fewer pieces than your opponent. I haven't been able to come up with a heuristic that for a given number of pieces on the board, can tell whether one player has a bigger advantage. That’s why I initialize the value function at 0.

I apologize if some of my explanation are not clear or are too obvious, do ask if you want me to expand on anything! The paper that will hopefully come out soon should give a lot more detail, but I’m happy to chat more if you’re interested!

Hope this is helpful, Jeroen

Casper-Guo commented 1 month ago

1, We are in the same ball park there. Any speed improvement from my logic is probably negated by the extra memory I/O since I am also saving the full graph.

2, I did it in 64 bits partly because I could not figure out how to fit it in 32 bits haha. The missing link was the arithmetic to convert trits to binary

3, Makes sense. I should try both solution and benchmark them then. I thought I wouldn't be memory limited either since I have 32G RAM on my PC. You might guess that is not enough with the 2x memory usage. Hopefully I will have access to more resources once I start my masters. Was hoping to use UM's HPC cluster but my account got deactivated

4, Can you please explain the 27 subdivisions thing?

5, Agreed. This is my conclusion as well.

6, I will have to look more closely at your code to understand this part. I have been thinking of ways to run this in parallel/async too.

JeroenOlieslagers commented 1 month ago

To explain the 27 subdivisions: Iterating over the full 137M state space is very expensive. In the beginning, states far from the terminal states will not be updated. Gradually, as VI progresses, these states will get updated. This will be near the end of VI. At this point, the value of states near terminal states will already have converged and hence iterating over them is pointless. To get around these inefficiencies, we want to iterate over states that are near the terminal states first, until their value has converged. Then, we want to update states that are a little bit further from the terminal states, until those have converged, and so on. This essentially breaks the expensive 137M loop up into multiple smaller loops, which converges more quickly.

The question is then, how do you determine which states are near the terminal states and which are far? One way is to use the inverse neighbour function, to check which states are one move away from each terminal state. This is not trivial. The other way is to simply count the pieces from each player.

We first do VI for all states where each player has only one piece. This is essentially solving RGU but with one piece instead of 7. Then, once we have the value of states with one piece, we do VI for states with two pieces, but we don’t have to update the states where each player has one piece anymore, since those have converged already.

There's one more trick, but you essentially end up with 27 subdivisions. This means you do VI 27 times, but each time over a much smaller state space.

We also do asynchronous VI on multiple threads to further speed things up.

Hope this clears some things up.

Sothatsit commented 1 month ago

It is known that even a crude heuristic for value initialization acceleration convergence. Did you try anything to that end?

Originally I was thinking of trying to use a heuristic based on "advancement". Advancement represents the sum of how far each piece has moved. A scored piece is worth 15*, and a piece yet to be played is worth 0. Pieces in play are worth somewhere in between. You then sum the value of all pieces, giving you advancement.

My idea was that the following might be a good heuristic:

light_remaining = light_advancement - max_advancement
dark_remaining = dark_advancement - max_advancement

if (light_remaining > dark_remaining) {
    heuristic = 100 * (dark_remaining / light_remaining)
} else {
    heuristic = 100 * (1 - light_remaining / dark_remaining)
}

This would result in a percentage that is based upon who is closer to winning. It would be interesting if you try something like this to hear if it helps! I never tried it since other optimisations already made the algorithm fast enough!

I made a mistake, this should have been 15 but I wrote it as 16 previously

JeroenOlieslagers commented 1 month ago

Ooooo I really like this idea! Should be fairly simple to add onto the existing infrastructure. On holiday right now, but I’ll implement it and report back.

Casper-Guo commented 1 month ago

The one I have in place for my own implementation right now is even simpler. Just 10 x (white_score - black_score) but your heuristic wouldn't cause any run time overhead either. Let me know how it turns out.

JeroenOlieslagers commented 1 month ago

I see, I think using the difference in score wouldn’t allow for a differentiation if the state space is already subdivided by number of pieces on the board from each player (since this already encodes the score of each player). I’m assuming by score you mean how many pieces each player has moved off the board?

It would also be interesting to look at the value of each state binned by player score to check how valid this heuristic is. We can do the same using @Sothatsit's heuristic since we've already solved finding the value function.

You can do this to basically ask: what simple and interpretable heuristic most accurately predicts the chances of winning. Maybe you could even make it a competition 👀. Define the "loss" of a heuristic to be the average deviation of the heuristic from the true value across all states. The lower the deviation the better.

L(h)=\frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}|V(s)-h(s)|

Sothatsit commented 1 month ago

@JeroenOlieslagers I'll do you one better than binning it by score. Here's the final win percentage binned by advancement!

And here's the graph of the absolute difference between the heuristic I suggested (kinda) and the true win percentages:

I did have to fix an error in the heuristic code I gave above. I needed to multiply by 0.5:

light_remaining = light_advancement - max_advancement
dark_remaining = dark_advancement - max_advancement

if (light_remaining > dark_remaining) {
    heuristic = 100 * (0.5 * dark_remaining / light_remaining)
} else {
    heuristic = 100 * (1 - 0.5 * light_remaining / dark_remaining)
}

The differences between the true win percentage and the heuristic are mostly below 10%! The only exceptions are all close to the end of a match where both players have really high advancements. I'd say that's pretty good, although you could almost definitely find a better starting heuristic that accounts for some of the patterns in the errors. Pretty cool though!

JeroenOlieslagers commented 1 month ago

Very nice! Seems the heuristic is almost an admissible one which is huge news. I really want to try initialise with this and see how much faster things run. I think it will actually give quite a lot of savings.

Seems winning probability is some two dimensional sigmoidal function of advancement.

JeroenOlieslagers commented 1 month ago

It seems near the middle it's very smooth and then there are some patterns near the edges that might be worth exploring. Either way, this could be an extra section on the paper if you're fine with that @Sothatsit.

Sothatsit commented 1 month ago

It seems near the middle it's very smooth and then there are some patterns near the edges that might be worth exploring. Either way, this could be an extra section on the paper if you're fine with that @Sothatsit.

Sure thing, if you think it would fit!

JeroenOlieslagers / game_of_ur

Questions about your Implementation #1