Out of memory - Githubissues

arr28 commented 10 years ago

Some time in the last few days, I've started getting OutOfMemoryErrors. Eventually (after ~1 hour of thrashing the disk), I managed to get a heap dump loaded up in MAT. Here's what it showed...

Nothing else is >100MB. (The next largest item is 60MB.) These items account for ~2.7GB - which is 94% of my available heap (3GB).

Cracking out sizes of objects...

22,414,100 TreeEdge   x  40 bytes = 897M (30%)
22,414,194 FDRLMI[]   x  24 bytes = 538M (18%)
11,095,026 long[]     x  32 bytes = 355M (12%)
11,094,896 FDRIMS     x  32 bytes = 355M (12%)
11,094,975 BitSet     x  24 bytes = 266M (9%)
 1,278,188 TreeNode   x 112 bytes = 143M (5%)
 3,834,663 double[]   x  35 bytes = 133M (4%)
 1,054,375 TreeEdge[] x 104 bytes = 110M (4%)

I'm going to start chopping the check-in history to find out where this happened but my prime suspect is 0e7f032823b3eba0fb8d6906f82ba995f80c6fde (allocating nodes on demand). It added a ForwardDeadReckonInternalMachineState to a TreeEdge. That's bad because there are lots of TreeEdges (22M) of which half (11M) seem to have an associated ForwardDeadReckonInternalMachineState. TreeEdges are unlimited, whereas nodes are pegged to 2M (and I only got to 1.3M before everything fell over in a big heap (no pun intended)). I used to be able to get the full 2M nodes without the sky falling in.

Also, I haven't examined the code, but presumably we're creating TreeEdges for all moves, even before we create the child TreeNodes. So, previously, each TreeNode had 1 (or more - this is a DAG) inbound TreeEdges meaning that there were no more TreeEdges than nodes. (When we binned a TreeNode, did we bin the TreeEdge(s) leading to it?) Whereas now, there can be way more TreeEdges than TreeNodes because we pre-create TreeEdges before potentially expanding them into TreeNodes?

Whatever the cause, I'm completely dead in the water at the moment, even on my more powerful machine.

arr28 commented 10 years ago

Here's the same information in the (working) revision before my suspected one.

2M TreeNodes with 2.8M TreeEdges. So whilst my theory about TreeEdges <= TreeNodes isn't quite right, there are a more sensible balance of TreeEdges and TreeNodes. Of the other relevant objects, there are 0.9M FDRIMS, BitSet & long[] (c.f. 11M above).

arr28 commented 10 years ago

As suspected, the problem is in 0e7f032823b3eba0fb8d6906f82ba995f80c6fde. My "analysis" above is all highly speculative. I haven't actually looked at the code yet (and won't now get to it until tomorrow at the earliest).

Steve, are you able to comment?

SteveDraper commented 10 years ago

Before we used to create a TreeEdge AND a TreeNode, with the state being in the TreeNode. Now we create only the TreeEdge (when the node is created later the state is just referencing the same object). HOWEVER, as you say, TreeEdge's were limited by the fixed pool size of TreeNodes before, whereas now they are not, so even though we have strictly LESS memory for a given number of expansions, we can wind up doing more expansions before we trim, which must be what is causing your problem. We're probably going to have to monitor the edge count too, and trim based on that getting too large. The only other alternative I can see is to not store the state at all and have to recalculate it every time on select, which would slow us down a lot.

I'll think on it, but any ideas would be welcome...

On Wed, Jun 4, 2014 at 4:47 PM, Andrew Rose notifications@github.com wrote:

Some time in the last few days, I've started getting OutOfMemoryErrors. Eventually (after ~1 hour of thrashing the disk), I managed to get a heap dump loaded up in MAT. Here's what it showed...

[image: image] https://cloud.githubusercontent.com/assets/3194913/3180242/9e7a1eb4-ec2e-11e3-93b2-0e3ec75c867c.png

Nothing else is >100MB. (The next largest item is 60MB.) These items account for ~2.7GB - which is 94% of my available heap (3GB).

Cracking out sizes of objects...

22,414,100 TreeEdge x 40 bytes = 897M (30%) 22,414,194 FDRLMI[] x 24 bytes = 538M (18%) 11,095,026 long[] x 32 bytes = 355M (12%) 11,094,896 FDRIMS x 32 bytes = 355M (12%) 11,094,975 BitSet x 24 bytes = 266M (9%) 1,278,188 TreeNode x 112 bytes = 143M (5%) 3,834,663 double[] x 35 bytes = 133M (4%) 1,054,375 TreeEdge[] x 104 bytes = 110M (4%)

I'm going to start chopping the check-in history to find out where this happened but my prime suspect is 0e7f032 https://github.com/SteveDraper/ggp-base/commit/0e7f032823b3eba0fb8d6906f82ba995f80c6fde (allocating nodes on demand). It added a ForwardDeadReckonInternalMachineState to a TreeEdge. That's bad because there are lots of TreeEdges (22M) of which half (11M) seem to have an associated ForwardDeadReckonInternalMachineState. TreeEdges are unlimited, whereas nodes are pegged to 2M (and I only got to 1.3M before everything fell over in a big heap (no pun intended)). I used to be able to get the full 2M nodes without the sky falling in.

Also, I haven't examined the code, but presumably we're creating TreeEdges for all moves, even before we create the child TreeNodes. So, previously, each TreeNode had 1 (or more - this is a DAG) inbound TreeEdges meaning that there were no more TreeEdges than nodes. (When we binned a TreeNode, did we bin the TreeEdge(s) leading to it?) Whereas now, there can be way more TreeEdges than TreeNodes because we pre-create TreeEdges before potentially expanding them into TreeNodes?

Whatever the cause, I'm completely dead in the water at the moment, even on my more powerful machine.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84.

SteveDraper commented 10 years ago

I agree with your earlier analysis, that edges used to be limited by the node pool, but are now not (well to within a factor of the branching factor anyway). I think the answer is to track the number of edges in circulation, and to trim when it exceeds a certain amount. Trimming nodes will also trim edges, but rather than trimming sparsely I think we need to modify the trimming algorithm to identify entire sub trees (rather than individual leaf nodes) to trim, and always trim so as to leave the parent of the trimmed tree as a (once again) totally unexpanded node. That approach should have several advantages:

1) it reduces edge count effectively 2) it doesn't leave partially trimmed nodes scattered around, which simplifies out an entire case

I wanted to address (2) anyway, because I suspect it is behind our poor play in some games. From the graphs I have noticed that these correlate quite well with games in which we fill the node pool (and hence trim a lot). They prevent ucb selection through nodes with trimmed children (selection from such a node is just a random choice between the untrimmed children).

I can see if I can get anywhere with this tomorrow if you don't get there first. I spent today reworking the fast propagator as I mentioned at the weekend, and that seems to have been successful (only 5-10% faster, but it all helps), so (subject to any issues that show up overnight on Tiltyard) that should leave me free to look at something else (such as edge trimming).

Sent from my iPad

On Jun 4, 2014, at 5:26 PM, Andrew Rose notifications@github.com wrote:

As suspected, the problem is in 0e7f032. My "analysis" above is all highly speculative. I haven't actually looked at the code yet (and won't now get to it until tomorrow at the earliest).

Steve, are you able to comment?

— Reply to this email directly or view it on GitHub.

arr28 commented 10 years ago

I'm possibly missing something here, but consider this.

The old way of doing things was...

(Selected node in blue, expanded node that we're going to depth charge from in red, uselessly expanded nodes in white.)

The new way of doing things is...

So we've lost the uselessly expanded nodes, but we still have a lot of useless edges which (a) are no longer capped at 2M and (b) have been given a machine state which adds considerably to their size.

Is there any good reason why we can't move to this...?

The array of children in the blue node simply has null values for the unexpanded children. A further selection through the blue node would expand one of those children.

(Previously expanded node in yellow.)

SteveDraper commented 10 years ago

That would roughly double selection processing time. The reason is that everywhere we select to will end in a leaf, which will require us to determine the state of the newly selected leaf node during selection (previously it would have been determined when the edge was created when the state machine was already in the correct state in order to generate the legal moves). Basically it means propagating each node's state twice (once on expansion to determine legal moves, and then again on selection to determine child state). Currently the edges act as a cache to avoid this.

However, I had an idea last night that would allow us to trade off performance for memory much more smoothly (and much less performance loss). Suppose we DO allocate the edges, but instead of keeping a state ref in the edge (which is the really big cost) we instead keep allocated states in a large (64K ish sort of size) ring buffer, and put a sequence number in the ruing buffer in the edge. That way if the edge was a recent one (within the ring buffer size in expansion history) then on select we'll be able to retrieve the state from the ring buffer and suffer no performance penalty. In the (much rarer, since MCTS will cause the majority of expansions to occur in 'hot' areas) case where the selected edge has a sequence number that is older than the ring buffer length relative to the current ring buffer allocation seq we can reconstruct it from the state machine. This should be pretty easy to do.

Steve

On Thu, Jun 5, 2014 at 3:16 AM, Andrew Rose notifications@github.com wrote:

I'm possibly missing something here, but consider this.

The old way of doing things was...

[image: image] https://cloud.githubusercontent.com/assets/3194913/3184697/adcd7690-ec88-11e3-9741-31b828ed88ed.png

(Selected node in blue, expanded node that we're going to depth charge from in red, uselessly expanded nodes in white.)

The new way of doing things is...

[image: image] https://cloud.githubusercontent.com/assets/3194913/3184717/01358656-ec89-11e3-8886-74e579d9f756.png

So we've lost the uselessly expanded nodes, but we still have a lot of useless edges which (a) are no longer capped at 2M and (b) have been given a machine state which adds considerably to their size.

Is there any good reason why we can't move to this...?

[image: image] https://cloud.githubusercontent.com/assets/3194913/3184761/b22629ac-ec89-11e3-9bb4-f82dc5fb6729.png

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45192652.

arr28 commented 10 years ago

That would roughly double selection processing time.

Really? Presumably on node expansion currently, the operations are...

Get the legals in the selected node, S (which we calculated at the time that node was expanded)
For each legal move L
- Load the state for S
- Get the next state as a result of doing L and store it in the edge (+ also the child node if this is the 1 node we're choosing the expand through right now)

So for each legal, we propagate twice - once to load S and once to apply L and get the next state. All I'm suggesting is that we don't do all of them up-front. Why is it more expensive to load S when we come to it again in the future than it is to re-load it as part of creating all the TreeEdges?

SteveDraper commented 10 years ago

No. It goes: 1) Propagate state S 2) Query legals (no extra propagation) 3) For each legal query next state (very minimal extra propagation because it is just the move prop that changes NOT the base props - this costs much less than imposing the base props in almost all games) and store it in the edge

There is only one propagation of base props to generate the legals and all the successor states. If we don't cache the states in the edges (or similar - see other emails) then we have to re-propagate the base props to calculate each child state. That is significantly more expensive. Note however, that if we do not know a priori (via depth from initial) that the successor state cannot be terminal, we also have to propagate the successor states at expansion time to determine terminality, and that cost applies in either scheme.

The argument can be made (at least for games where mostly we do not have a priori knowledge of non-terminality) that successor state terminality determination dominates over successor state determination itself, and in THAT case having to re-determine the successor state on selection is a small overhead, since on any selection we'll be doing an expansion as well. For that reason I don't plan to bother with the ring buffer (see earlier email) until I've seen how much impact having to do the extra propagation to determine state on selection has. My expectation is that the impact will actually be low in most games (where we have to calculate terminality of successors anyway), but high in games where we don't (Dot&Boxes, Reversi, ...). Adding the ring-buffer caching is therefore a secondary concern since it would impact only a subset of games.

Steve

On Thu, Jun 5, 2014 at 8:16 AM, Andrew Rose notifications@github.com wrote:

That would roughly double selection processing time.

Really? Presumably on node expansion currently, the operations are...

Get the legals in the selected node, S (which we calculated at the time that node was expanded)

For each legal move L

Load the state for S

Get the next state as a result of doing L and store it in the edge (+ also the child node if this is the 1 node we're choosing the expand through right now)

So for each legal, we propagate twice - once to load S and once to apply L and get the next state. All I'm suggesting is that we don't do all of them up-front. Why is it more expensive to load S when we come to it again in the future than it is to re-load it as part of creating all the TreeEdges?

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45217343.

arr28 commented 10 years ago

Ah, I think I understand now. In step (3), you don't propagate the new child state, you just read it off the transitions? Therefore, there's nothing to reset to get back to the original state (in order to do the next legal).

FWIW, I quite like the ring buffer idea, but agree that it's a nice extra.

SteveDraper commented 10 years ago

Ok. I'll aim to at least have the basic no-states-in-edges (without the ring buffer) done today. Suggest you take the re-use state in node on reset/re-use optimization meanwhile, since you're (presumably) held up from ongoing ELB work until this is resolved.

Steve

On Thu, Jun 5, 2014 at 10:12 AM, Andrew Rose notifications@github.com wrote:

Ah, I think I understand now. In step (3), you don't propagate the new child state, you just read it off the transitions? Therefore, there's nothing to reset to get back to the original state (in order to do the next legal).

FWIW, I quite like the ring buffer idea, but agree that it's a nice extra.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45231864.

arr28 commented 10 years ago

Running with the latest develop branch code.

This was a game of D&B with 15s deadlines. Although I'm not missing the deadlines now (nor suffering OutOfMemoryErrors), I am still having considerable occupancy and GC problems.

Some observations...

At the end of turn 9, I have completely filled my heap.
Until that point, GC has been excellent (<100ms).
Memory usage drops off steadily from turn 9.
Until the end game, I never manage to use more than 33% of the node pool, and usually peaking at more like 20%.
GC gets progressively worse throughout the game.
As GC gets worse, expansion rate becomes increasingly erratic.

I have a hunch that expansion rate (before it becomes erratic) is considerably better than before. +50%??

I'm going to try two things.

Firstly, I've always run with a target GC time of 800ms. I now gather the default is 200ms, so I'm going to try with that.
Secondly, I clearly need to run a game with a longer clock. Tiltyard can easily give a minute by the time it's been a bit late getting back and I need to now that we won't suffer a disaster.

SteveDraper commented 10 years ago

I'm testing a version that no longer allocated Edges at all until they are selected through. Hope to have that ready tonight (but not sure - it was a rather fiddly change)

On Fri, Jun 6, 2014 at 3:31 PM, Andrew Rose notifications@github.com wrote:

Running with the latest develop branch code.

This was a game of D&B with 15s deadlines. Although I'm not missing the deadlines now (nor suffering OutOfMemoryErrors), I am still having considerable occupancy and GC problems.

[image: image] https://cloud.githubusercontent.com/assets/3194913/3205502/841d5498-edb7-11e3-807c-28be2047d433.png

Some observations...

At the end of turn 9, I have completely filled my heap.

Until that point, GC has been excellent (<100ms).

Memory usage drops off steadily from turn 9.

Until the end game, I never manage to use more than 33% of the node pool, and usually peaking at more like 20%.

GC gets progressively worse throughout the game.

As GC gets worse, expansion rate becomes increasingly erratic.

I have a hunch that expansion rate (before it becomes erratic) is considerably better than before. +50%??

I'm going to try two things.

Firstly, I've always run with a target GC time of 800ms. I now gather the default is 800ms, so I'm going to try with that.

Secondly, I clearly need to run a game with a longer clock. Tiltyard can easily give a minute by the time it's been a bit late getting back and I need to now that we won't suffer a disaster.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45381058.

arr28 commented 10 years ago

Using the default target (i.e. not specifying a target), I get smoother performance later on (mostly peaks of 500 with occasional peaks of 800 rather than frequent peaks of 800 with occasional peaks pushing on for a second as seen in the last graph). However, I suffered two massive collections in turns 8 and 13 which were lucky not to cause a missed deadline.

arr28 commented 10 years ago

Increasing the deadline to 1 minute (keeping the 200ms GC pause target) has pushed me over the edge. I missed the deadline at the end of turn 2 (and, by the looks of it, was fortunate not to miss it at the end of 7, 8 & 9 too).

I managed to fill my node table once but, due to GC, have typically only been getting to 75% full subsequently. Also, I'm repeatedly crashing into the ceiling in terms of memory allocation (which is what's triggering the big collections).

I'm looking forward to the improvements that you're adding now.

Once I've picked up your changes (and you've finished working in the area), I'm planning to add pooling to any frequently allocated objects that we create and discard. I'll probably create unbounded pools for them, because the I think they're now limited by the number of nodes. I just want to be recycling them rather than throwing them away, buying new ones & paying the price in GC. I can see my blog post title already - "Sancho goes Green"!

SteveDraper commented 10 years ago

Ok - I have some ideas for more efficiently pooling / laying out some of the MCTS tree structure stuff (edges etc) which it would be worth discussing tomorrow. Would be nice to get a cache-friendly layout of the frequent access patterns at the same time

On Fri, Jun 6, 2014 at 4:17 PM, Andrew Rose notifications@github.com wrote:

Increasing the deadline to 1 minute (keeping the 200ms GC pause target) has pushed me over the edge. I missed the deadline at the end of turn 2 (and, by the looks of it, was fortunate not to miss it at the end of 7, 8 & 9 too).

[image: image] https://cloud.githubusercontent.com/assets/3194913/3205982/99f38222-edbe-11e3-8225-acbf627e0a48.png

I managed to fill my node table once but, due to GC, have typically only been getting to 75% full subsequently. Also, I'm repeatedly crashing into the ceiling in terms of memory allocation (which is what's triggering the big collections).

I'm looking forward to the improvements that you're adding now.

Once I've picked up your changes (and you've finished working in the area), I'm planning to add pooling to any frequently allocated objects that we create and discard. I'll probably create unbounded pools for them, because the I think they're now limited by the number of nodes. I just want to be recycling them rather than throwing them away, buying new ones & paying the price in GC. I can see my blog post title already - "Sancho goes Green"!

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45385360.

arr28 commented 10 years ago

It's probably instructive to see the end of that last game (which has just finished). I saw that same steady decline in heap usage again, which meant that I didn't hit the ceiling and trigger another long GC pause. Peaks were ~1 second.

It spent considerable periods with a full node table without any apparent negative side effects.

SteveDraper commented 10 years ago

Sounds like we're making progress :-)

On Fri, Jun 6, 2014 at 4:45 PM, Andrew Rose notifications@github.com wrote:

It's probably instructive to see the end of that last game (which has just finished). I saw that same steady decline in heap usage again, which meant that I didn't hit the ceiling and trigger another long GC pause. Peaks were ~1 second.

It spent considerable periods with a full node table without any apparent negative side effects.

[image: image] https://cloud.githubusercontent.com/assets/3194913/3206267/4ac4701c-edc3-11e3-9c22-e602508e8e68.png

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45387596.

arr28 commented 10 years ago

And with your latest changes, this is looking much better on the occupancy front. Here's a 15s game of D&B (200ms GC target).

Now I just need to add pooling of the remaining objects to see if I can deal with the GC.

arr28 commented 10 years ago

We're now pooling TreeNodes and TreeEdges, but what else do we churn through? Here's a snapshot of everything that has been garbage collected after running a game of C4 with 60s deadlines for a few minutes.

This is all objects where we've collected at least 100,000 instances. (The next item on the list had 23K instances collected).

Dealing with TreeNodeRefs was already on my list (because they're relatively easy to deal with). However, it looks like by far the biggest nightmare here (by size and by instance count) is the iterator in FDRIMS.

arr28 commented 10 years ago

Here's the culprit...

  @Override
  public Iterator<ForwardDeadReckonPropositionInfo> iterator()
  {
    return new InternalMachineStateIterator(this);
  }

Steve - I have a recollection that you did something to re-use iterators in some other case. Might the same techniques apply here? I'll fix, but if you can easily give me a nudge in the right direction, please do.

SteveDraper commented 10 years ago

Yes. The key observation (for the others, which were ForwardDeadReckonLegalMoveSetIterator) was that the parent collection was only ever used by one thread, and the call pattern was always such that there were no nested enumerations of the same collection. Consequently the parent collection could allocate a single iterator and reuse it (calling some form of reset() on it each time it passes it out). However, this doesn't apply here, because machine states are passed to rollouts as starting points to roll out from, and hence are used from multiple threads. MOST usages will be within a single thread, but not all, so unless we can figure out a way to restrict the case (perhaps with some explicit method to retrieve a non-thread-safe (singleton) iterator or something) we cannot use the same technique. The problem with iterators in Java is that the usage pattern does not tell you when the caller is done with it.

Most of the enumerations that use these iterators are in one place - in ForwardDeadReckonPropNetStateMachine.setBasePropositionsFromState(). All of THOSE uses ARE guaranteed to be on the same thread for any given state machine, so if we could create an iterator 'uncoupled' from its parent container, then 'attach' it, we could create a singleton in the state machine, attach it to the InternalMachineState instance and then enumerate (the attach would set it to the first element). That way the major usage would not require allocation.

On Tue, Jun 10, 2014 at 3:01 PM, Andrew Rose notifications@github.com wrote:

Here's the culprit...

@Override public Iterator iterator() { return new InternalMachineStateIterator(this); }

Steve - I have a recollection that you did something to re-use iterators in some other case. Might the same techniques apply here? I'll fix, but if you can easily give me a nudge in the right direction, please do.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45663963.

arr28 commented 10 years ago

Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...

The next biggest (LinkedList$Node) has 4.7M.

Note that, due to a limitation of the tool, the above does not include array allocation.

SteveDraper commented 10 years ago

The intermediary list of moves in TreeNode.expand() which we talked about at the weekend will be a big one. That's trivial to remove now (getLegalMoves() returns a collection which we happen to be treating just as an iterable, and if you start treating it as a collection (so you can call size()) you'll be able to directly use it I think. The call to remove() is not needed any more and the loop it's in can be a simple iterator over the collection now instead of the explicit indexing.

I can do this easily I think if you prefer me to tackle that one...?

On Tue, Jun 10, 2014 at 4:14 PM, Andrew Rose notifications@github.com wrote:

Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...

[image: image] https://cloud.githubusercontent.com/assets/3194913/3237095/1aadb8ce-f0e4-11e3-9a97-29f67b7c3548.png

The next biggest (LinkedList$Node) has 4.7M.

Note that, due to a limitation of the tool, the above does not include array allocation.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45672788.

SteveDraper commented 10 years ago

BTW - I don't think you need the overhead of the ThreadLocal. Any given instance of the state machine is guaranteed to only be used on one thread at once.

On Tue, Jun 10, 2014 at 4:14 PM, Andrew Rose notifications@github.com wrote:

Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...

[image: image] https://cloud.githubusercontent.com/assets/3194913/3237095/1aadb8ce-f0e4-11e3-9a97-29f67b7c3548.png

The next biggest (LinkedList$Node) has 4.7M.

Note that, due to a limitation of the tool, the above does not include array allocation.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45672788.

arr28 commented 10 years ago

Here we are again, this time with array allocations too.

So it looks like we're allocating 2 long[] per FDRIMS. I think they'll be accounted for by the BitSet in the FDRIMS (although some of the maths isn't quite right).

Also, approximately as many boolean[] and double[] as there are FDRIMS (although I'm not expecting them to be directly related).

arr28 commented 10 years ago

Oh - I think the reason we're allocating 2 long[] per FDRIMS is because we don't tell the BitSet what size it should be. As a result, it allocates an array of size 1 and then, when we set the first proposition > 64, it has to reallocate the array.

Also, the graph above shows a relatively rosy picture of FDRIMS allocation. In any game with the piece heuristic, it would be much worse (because we create temporary FDRIMS to do some calculations).

The next commit should fix both of those.

arr28 commented 10 years ago

Okay, the "standard profile" for measurement runs is C4 with 30/60 deadlines, recording for 1 minute from approximately the middle of the second turn.

Here's one from before the TNR fix.

arr28 commented 10 years ago

And here's one on develop. Note that, although I've aimed to get some comparable absolute figures, the profiler so massively affects performance (we get ~1/10th of the rollout rate when it's running) that the figures aren't really directly comparable. The real benefit is in the difference between numbers of objects.

So the TNR has been completely removed from GC (obviously, the class doesn't even exist any more) and there are now the same number of long[] collections as there are OpenBitSet (not BitSet) and FDRIMS collections. I also hope that some of the reduction in state machine allocations is because we no longer record temporary ones, but as per the comment above, I'm not hopeful that the numbers are really directly comparable.

arr28 commented 10 years ago

In C4, I think the key allocation site of FDRIMS is...

MCTSTree.growTree
- MCTSTree.selectAction
- TreeNode.expand
  - FDRPSM.getNextState
  - FDRPSM.getInternalStateFromBase

...and a possible solution is for mChildStatesBuffer (in MCTSTree) to be a pre-allocated array of re-usable FDRIMS. TreeNode.expand (the only user of mChildStatesBuffer) would reset the state and then pass it to a new version of FDRPSM.getNextState taking an extra FDRIMS as the object to create the new state into. The would call through to a new version of FDRPSM.getInternalStateFromBase which against would take an FDRIMS as a parameter and re-use it, rather than reallocating. Finally, expand() would copy the resultant state into an FDRIMS that lives permanently in the new child TreeNode (rather than changing the variable to refer to a newly allocated object).

Beware that the key allocation site may vary in other games. The following factors may well be relevant.

Whether the game is factored
Whether the game is simultaneous play
Whether greedy rollouts are enabled
Whether the PieceHeuristic is enabled
...?

SteveDraper commented 10 years ago

Sounds like a good plan. Only thing to potentially beware of when implementing is that the last state applied is stored (a ref to it that is in lastInternalSetState) by setBasePropositionsFromState(). There is already code there (conditionally invoked dependent on he 'isolate' parameter) to make this an actual copy rather than just a re-use by reference, but its something to watch out for.

arr28 commented 10 years ago

By doing the plan above, we now have significantly fewer allocations of FDRIMS, but there are still a lot of them. FDRIMS used to be collected at 130% of the rate at which double[]s were collected. Now it's just 60%. I haven't really managed to get a grip of JProfiler to figure out where the remaining key allocation sites are.

arr28 commented 10 years ago

Okay, so I think the remaining allocation sites for FDRIMS (in C4) are...

RolloutProcessor

RolloutRequest.process
- FDRPSM.getDepthChargeResult
- FDRPSM.recursiveGreedyRollout
  - FDRPSM.doRecursiveGreedyRollout
  - FDRPSM$TerminalResultSet.considerResult => 45%
  - FDRPSM.transitionToNextStateInGreedyRollout => 43%

GameSearcher

GameSearcher.expandSearch
- MCTSTree.growTree
- MCTSTree.selectAction
  - TreeNode.select
  - TreeNode.createChildNodeForEdge
    - FDRPSM.getNextState
    - FDRPSM.getInternalStateFromBase => 12%

SteveDraper commented 10 years ago

Can you tell what the ratio of FDRIMS to node allocations is (the latter are pooled so obviously not from GC stats)?

The main source is likely to be the construction in getNextState(ForwardDeadReckonInternalMachineState, Factor, ForwardDeadReckonLegalMoveInfo[]) - obviously there will be the state in each node.

Apart from that what game are you using? If it uses greedy rollouts then the allocation of decisionState.state during playouts at each state will likely dominate (and can probably be engineered away - the greedy rollout process is not terribly efficient)

Steve

On Fri, Jun 13, 2014 at 2:44 PM, Andrew Rose notifications@github.com wrote:

By doing the plan above, we now have significantly fewer allocations of FDRIMS, but there are still a lot of them. FDRIMS used to be collected at 130% of the rate at which double[]s were collected. Now it's just 60%. I haven't really managed to get a grip of JProfiler to figure out where the remaining key allocation sites are.

[image: image] https://cloud.githubusercontent.com/assets/3194913/3274704/e09fd580-f332-11e3-8531-41a31c32884f.png

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46052283.

arr28 commented 10 years ago

In non-greedy-rollout games (e.g. 9B-TTT), FDRIMS is now way down the list of things we care about.

The remaining allocations are done as per the GameSearcher stack above.

Next on the hit list must be all those double[]s and TreePathElements. Mustn't forget to return to the greedy rollout scenarios though.

SteveDraper commented 10 years ago

I can take a look at that this weekend if you like since I'm probably more familiar with that code.,,,?

On Fri, Jun 13, 2014 at 3:35 PM, Andrew Rose notifications@github.com wrote:

In non-greedy-rollout games (e.g. 9B-TTT), FDRIMS is now way down the list of things we care about.

[image: image] https://cloud.githubusercontent.com/assets/3194913/3275227/3162ce0e-f339-11e3-8162-c2dbc28cd39a.png

The remaining allocations are done as per the GameSearcher stack above.

Next on the hit list must be all those double[]s and TreePathElements. Mustn't forget to return to the greedy rollout scenarios though.

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46056945.

arr28 commented 10 years ago

Steve,

Not sure exactly which of the above you're offering to look at. If it's the greedy rollouts, then feel free - but do familiarise yourself with what I've done first.

I'm about to check in an enhancement that should remove FDRIMS allocations from the non-greedy-rollout case (createChildNodeForEdge). I'm definitely planning to deal with TreePathElements myself (I know how I'm going to do it) and plan to at least look at where all the double[]s are coming from.

Andrew

SteveDraper commented 10 years ago

Yes, I meant the greedy rollouts, and I'll make sure I understand what you've done already first ;-)

On Fri, Jun 13, 2014 at 4:15 PM, Andrew Rose notifications@github.com wrote:

Steve,

Not sure exactly which of the above you're offering to look at. If it's the greedy rollouts, then feel free - but do familiarise yourself with what I've done first.

I'm about to check in an enhancement that should remove FDRIMS allocations from the non-greedy-rollout case (createChildNodeForEdge). I'm definitely planning to deal with TreePathElements myself (I know how I'm going to do it) and plan to at least look at where all the double[]s are coming from.

Andrew

— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46060621.

SteveDraper commented 10 years ago

Fixed bug in machine state reuse that was causing failure when a pre-allocated instance was used in creating new child nodes

arr28 commented 10 years ago

Excellent - in 9BTTT there are now no FDRIMS collections.

arr28 commented 10 years ago

We no longer collect any TreePaths or TreePathElements. Also, there are no ArrayLists and very few ArrayList$Itrators.

The real big one remaining is all those double[]s which I'll look at next.

General allocation hotspots (all objects) are TreeNode.updateStats and TreeNode.expand and to a lesser extent, CombinedHeuristic.getHeuristicValue (due to iteration).

arr28 commented 10 years ago

double[]s dealt with. Iterators are next most common.

arr28 commented 10 years ago

In 9BTTT, these are the only 4 objects of any consequence that still get collected.

I still don't know where the HashMap$KeyIterators are coming from.
The TreeEdges are leaking. I knew that I hadn't found all the places that they could become garbage. I'll have another hunt. If I find some relevant places, it's just a case of freeing them back to the pool when they're no longer needed.
The Object[]s are the children of TreeNodes. I have a plan for them.
If HashMap$Entry is simply removing TreeNodes from the transposition table when we recycle them, there's nothing we can do about it. They could conceivably also be to do with Heuristics and there might be something we can do about that.

arr28 commented 10 years ago

Remaining things that I plan to address under this bug if I can...

Connect Four Simultaneous still is producing the following garbage (at a moderate rate).

And here are the key allocation sites.

We still produce the following garbage in 9BTTT.

And here are the key allocation sites.

We still have very deep stacks due to unnecessary recursion in updateStats. We no longer allocate anything in them, but they'll slow down GC. Convert to iteration.

arr28 commented 10 years ago

For 9BTTT, as far as I can make out, the above tells me that 9.1% of the allocations are HashSet.iterator and are being allocated directly from GameSearcher.run. A further 16.7% are the same, being allocated from GameSearcher.expandSearch.

The 23.9% generated by HashMap.put is unavoidable (without writing a HashMap replacement that doesn't do runtime allocation of resources, which isn't totally daft but isn't a target for this issue).

arr28 commented 10 years ago

Aha - culprits located.

From GameSearcher.run()...

              synchronized(getSerializationObject())
              {
                //  Must re-test for a termination request having obtained the lock
                if (!mTerminateRequested)
                {
                  for (MCTSTree tree : factorTrees)
                  {
                    tree.gameCharacteristics.setExplorationBias(currentExplorationBias);
                  }
                  complete = expandSearch(false);
                }
              }

...and from GameSearcher.expandSearch(), both...

      for (MCTSTree tree : factorTrees)
      {
        if (!tree.root.complete)
        {
          //  The trees may have very asymmetric sizes due to one being nearly
          //  complete, in which case it is possible that no candidates for trimming
          //  will be found.  This should not be possible in all trees if the node pool
          //  is nearly full, so check that at least one tree does release something
          somethingDisposed |= tree.root.disposeLeastLikelyNode();
        }
      }

...and...

    for (MCTSTree tree : factorTrees)
    {
      if (!tree.root.complete)
      {
        if (ThreadControl.ROLLOUT_THREADS > 0 && !forceSynchronous)
        {
          // If there's back-propagation work to do, do it now, in preference to more select/expand cycles because the
          // back-propagation will mean that subsequent select/expand cycles are more accurate.
          processCompletedRollouts(false);
        }

        // Perform an MCTS iteration.
        lAllTreesCompletelyExplored &= tree.growTree(forceSynchronous);
      }
    }

All of these iterate over factorTrees which is a HashSet. I reckon there's no good reason why it can't be a plain array.

arr28 commented 10 years ago

GC on 9BTTT is all done now. Garbage rate reduced to 1.4MB/s (which is all HashEntry).

arr28 commented 10 years ago

The remaining C4-Simultaneous allocations are in the somewhat grotty class FactorFilteredForwardDeadReckonLegalMoveSet. This implements a collection in a somewhat inefficient way...

Taking the size() of the collection (which we do), involved iterating over the collection and counting the objects. Every time.
Iteration itself is something we do at least once at every step of a rollout and isn't completely trivial.
Determining whether it's empty also involves creating an iterator and seeing whether it has a first object.

I think the whole thing could be replaced by a static method which takes the current constructor args and returns an array. Then, rather than allocating an array, it could take a "static" array and write the contents into it, returning the number of elements that are valid (i.e. the effective length).

arr28 commented 10 years ago

Sigh - there's another game that needs some attention.

Breakthrough (Small) - and possibly all games with PieceHeuristic enabled and/or greedy rollouts?

Here's where it's all being allocated. Note that it's effectively the same call-stack twice. Once from the rollout threads and once from the game-search thread (which must be doing rollouts because the pipeline is full).

arr28 commented 10 years ago

So the to-do list here is...

C4 Simultaneous (analysis above)
Breakthrough small (not yet analysed)

If there's any chance that there are more categories of game that haven't been exercised yet, now's the time to spot them before the evaluation of JProfiler comes to an end. (I can grab the information as shown above and then fix at my leisure.) I've got 2 days left (and one of those, I've got something on in the evening, so won't be able to progress this).

arr28 commented 10 years ago

C4 Simultaneous dealt with. Still need to look at greedy rollout and/or piece heuristic games. Also, have a trawl through the last 24 hours of play to see if any other games are producing significant garbage.

SanchoGGP / ggp-base

Out of memory #84