Open arr28 opened 9 years ago
Yes - this appears to be an issue. Even in meta-gaming time, we quickly hit my 4G heap limit. Presumably this is largely down to the addition of per-node RAVE stats. RAVE stats are an int + a double per max branching factor. Max branching factor seen for Skirmish zero-sum is 83. With a node pool capacity of 2,000,000, that's (4 + 8) * 83 * 2,000,000 = 2GB
which is half my total heap!
I don't understand why this isn't a problem in all games. Looking at my logs, I'm seeing issues in quite a lot of games (skirmish zero-sum, skirmish new, skirmish variant, speed chess, D&B suicide) but not universally. For example, hex seemed okay, despite having a max branching factor of 81. Is the rest of the state considerably smaller in Hex? Does some feature get enabled in skirmish (but not Hex) which causes significant additional occupancy? Or maybe Hex is much slower to simulate, so I never get anywhere near filling the node pool?
Both Hex and Skirmish do approx. 5K rollouts/s on my machine. However, after doing ~150K rollouts, the Hex node pool is ~10% full (i.e. is producing 1 node per rollout) but the Skirmish node pool is full (and I suspect has been pruning for a few seconds).
Why the difference here? How are we using (much) more than 1 node per rollout?
If there is a heuristic change we have to create all the siblings of the expanded node at the point of expansion. Skirmish uses the piece heuristic (whereas Hex doesn't).
I'll convert doubles to floats, which will save 1/3. I'll also investigate switching to a non-allocating transposition table.
Okay, so the above fixes have significantly helped, but they certainly don't solve the problem completely.
So the per-node * per-child data the we currently store is...
Object mChildren
- reference (to either a TreeEdge
or a ForwardDeadReckonLegalMoveInfo
)short mPrimaryChoiceMapping
int mRAVECounts
float mRAVEScores
...which comes to 18 bytes.
For a default node table size (2,000,000) and a typical branching factor for a large game (100), every byte here costs 100MB. So, this currently accounts for ~2GB. Alongside everything else, that's absolutely at my limit (or over it sometimes).
I experimented with modifying the 'modern' heuristic to use the approach that would allow us to eliminate the need to create nodes for all siblings of nodes with a heuristic step, but it resulted in significant degradation in play, so I abandoned it.
I think approach (1) is the way to go for now, but I also offer a (possible) extra approach. [Moved to (3) above.]
Further comment on a bit more reflection:
RAVE stats are all 0s until the first playout takes place through a node. So most leaf nodes in a game with heuristics (that adds most children at expansion time) will have no useful stored values. If the RAVE arrays were pooled we could therefore trivially hold off allocating them until that first playout happens (so allocate on demand in the update)
Delaying allocation of RAVE stats to the point of first use appears to have made a reasonable difference on occupancy for Skirmish.
Under the old scheme, in the time it took to fill the node pool from 16% -> 84%, heap usage rose from 825MB to 2,861MB. That's a rate of 30MB per 20,000 nodes.
Under the new scheme, in the time it took to fill the node pool from 16% -> 85%, heap usage rose from 532MB -> 1,724MB. That's a rate of 17MB per 20,000 nodes.
Slightly lower iteration rate with the new code, but well within the noise level.
I was still seeing lots of (short) GC and was very near my PC RAM limit because I various other things running. Not sure that this is really resolved yet. Will need to check out some longer runs.
My player played another 4 matches since the new transposition table added on the 20th June (3 matches on the 29th June and 1 on 1st July). All showed lots of small early GC activity, then dying away to nothing most of the time but a few massive spikes (3s - 7s). Very similar to the graphs at the top of this page, but GC spikes were less frequent. One of them caused a missed turn.
I was also playing with a larger max. heap size for these games (some 5GB, some 6GB compared to 4GB in the past).
Disappointingly, the large GC spikes were managing to free a significant chunk of heap (~700MB a pop, coming very roughly 4 mins apart). That indicates that we're producing significant quantities of garbage.
Not good enough yet. Still seeing a single spike of 2-3s GC approximately every 7 mins, which did cause a (single) deadline miss for 1 of the last 5 games played.
The GC attempts are reducing the heap by just over 1GB each time.
Untagging for IGGPC15 since this doesn't really cause problems on Steve's machine. Putting up to P1 for afterwards.
I now have a machine with more RAM, so lowering priority of this.
In both my recent matches of Skirmish Zero-Sum, I've had significant GC issues, leading to missed deadlines.
Possible causes...
Every minute or so, we're collecting ~400MB. That's quite a lot and probably bears some investigation. However, right from the start, I'm bumping up against the ceiling, so I think (2) and (3) are bigger issues.