Massive GC causes deadline misses in skirmish zero-sum

arr28 commented 9 years ago

In both my recent matches of Skirmish Zero-Sum, I've had significant GC issues, leading to missed deadlines.

Possible causes...

We're producing garbage at sufficient rate that it's causing us problems.
The state size is such that I'm at my memory limit.
The state isn't too large, but something wasn't cleaned up properly in a previous run.

Every minute or so, we're collecting ~400MB. That's quite a lot and probably bears some investigation. However, right from the start, I'm bumping up against the ceiling, so I think (2) and (3) are bigger issues.

arr28 commented 9 years ago

By far the biggest garbage generator is the transposition table which causes allocations of Longs (for node refs) and HashMap$Entry. And whilst it does produce a fair bit a garbage, that isn't the big issue here.
Yes - this appears to be an issue. Even in meta-gaming time, we quickly hit my 4G heap limit. Presumably this is largely down to the addition of per-node RAVE stats. RAVE stats are an int + a double per max branching factor. Max branching factor seen for Skirmish zero-sum is 83. With a node pool capacity of 2,000,000, that's (4 + 8) * 83 * 2,000,000 = 2GB which is half my total heap!

I don't understand why this isn't a problem in all games. Looking at my logs, I'm seeing issues in quite a lot of games (skirmish zero-sum, skirmish new, skirmish variant, speed chess, D&B suicide) but not universally. For example, hex seemed okay, despite having a max branching factor of 81. Is the rest of the state considerably smaller in Hex? Does some feature get enabled in skirmish (but not Hex) which causes significant additional occupancy? Or maybe Hex is much slower to simulate, so I never get anywhere near filling the node pool?
No evidence of this. Given that I can fill my heap immediately when starting from scratch, I'm not going to consider this further.

arr28 commented 9 years ago

Both Hex and Skirmish do approx. 5K rollouts/s on my machine. However, after doing ~150K rollouts, the Hex node pool is ~10% full (i.e. is producing 1 node per rollout) but the Skirmish node pool is full (and I suspect has been pruning for a few seconds).

Why the difference here? How are we using (much) more than 1 node per rollout?

SteveDraper commented 9 years ago

If there is a heuristic change we have to create all the siblings of the expanded node at the point of expansion. Skirmish uses the piece heuristic (whereas Hex doesn't).

arr28 commented 9 years ago

I'll convert doubles to floats, which will save 1/3. I'll also investigate switching to a non-allocating transposition table.

arr28 commented 9 years ago

Okay, so the above fixes have significantly helped, but they certainly don't solve the problem completely.

Quantifying the problem

So the per-node * per-child data the we currently store is...

Object mChildren - reference (to either a TreeEdge or a ForwardDeadReckonLegalMoveInfo)
short mPrimaryChoiceMapping
int mRAVECounts
float mRAVEScores

...which comes to 18 bytes.

For a default node table size (2,000,000) and a typical branching factor for a large game (100), every byte here costs 100MB. So, this currently accounts for ~2GB. Alongside everything else, that's absolutely at my limit (or over it sometimes).

Possible approaches

I note that, in Skirmish (zero-sum), the max. branching factor ever observed is 83, but the average branching factor is 34. We allocate on the basis of 83 children. For this game, at least, we could halve the storage cost by using a better allocation scheme. (The reason we allocate the max. branching factor is so that we don't have to reallocate - which is inefficient and causes GC problems. However, this may no longer be viable.) Possible here is to have pools of Object/short/int/float arrays of a variety of sizes (say every 8) and attach one of a suitable size. There's still some loss associated with this - both in rounding up to the nearest 8 and also in unused arrays that have been returned to the pool.
Steve - you mention above that skirmish is worse than Hex because, due to heuristics, we create all the children of expanded nodes, leading to lots of useless leaf nodes that have never been used. I also think you said on our last call that your recent work in heuristic application might do away with that requirement. Is that still the case?
Don't create RAVE for unexpanded nodes (not creating them would effectively imply the need to pool them though, since we currently re-use the allocated arrays). Instead, on first expansion, copy the RAVE values of the parent, which will be a very good approximation of the same RAVE statistics. This may slightly degrade RAVE effectiveness, so it would have to be tried out and evaluated for any such degradation, but it should save a lot of memory. However, since it requires the arrays to be pooled first, it still seems like approach (1) is the place to start.

SteveDraper commented 9 years ago

I experimented with modifying the 'modern' heuristic to use the approach that would allow us to eliminate the need to create nodes for all siblings of nodes with a heuristic step, but it resulted in significant degradation in play, so I abandoned it.

I think approach (1) is the way to go for now, but I also offer a (possible) extra approach. [Moved to (3) above.]

SteveDraper commented 9 years ago

Further comment on a bit more reflection:

RAVE stats are all 0s until the first playout takes place through a node. So most leaf nodes in a game with heuristics (that adds most children at expansion time) will have no useful stored values. If the RAVE arrays were pooled we could therefore trivially hold off allocating them until that first playout happens (so allocate on demand in the update)

arr28 commented 9 years ago

Delaying allocation of RAVE stats to the point of first use appears to have made a reasonable difference on occupancy for Skirmish.

Under the old scheme, in the time it took to fill the node pool from 16% -> 84%, heap usage rose from 825MB to 2,861MB. That's a rate of 30MB per 20,000 nodes.

Under the new scheme, in the time it took to fill the node pool from 16% -> 85%, heap usage rose from 532MB -> 1,724MB. That's a rate of 17MB per 20,000 nodes.

Slightly lower iteration rate with the new code, but well within the noise level.

I was still seeing lots of (short) GC and was very near my PC RAM limit because I various other things running. Not sure that this is really resolved yet. Will need to check out some longer runs.

arr28 commented 9 years ago

My player played another 4 matches since the new transposition table added on the 20th June (3 matches on the 29th June and 1 on 1st July). All showed lots of small early GC activity, then dying away to nothing most of the time but a few massive spikes (3s - 7s). Very similar to the graphs at the top of this page, but GC spikes were less frequent. One of them caused a missed turn.

I was also playing with a larger max. heap size for these games (some 5GB, some 6GB compared to 4GB in the past).

Disappointingly, the large GC spikes were managing to free a significant chunk of heap (~700MB a pop, coming very roughly 4 mins apart). That indicates that we're producing significant quantities of garbage.

arr28 commented 9 years ago

Not good enough yet. Still seeing a single spike of 2-3s GC approximately every 7 mins, which did cause a (single) deadline miss for 1 of the last 5 games played.

The GC attempts are reducing the heap by just over 1GB each time.

arr28 commented 9 years ago

Untagging for IGGPC15 since this doesn't really cause problems on Steve's machine. Putting up to P1 for afterwards.

arr28 commented 5 years ago

I now have a machine with more RAM, so lowering priority of this.

SanchoGGP / ggp-base

Massive GC causes deadline misses in skirmish zero-sum #296

Quantifying the problem

Possible approaches