Closed arr28 closed 10 years ago
Here's the same information in the (working) revision before my suspected one.
2M TreeNode
s with 2.8M TreeEdge
s. So whilst my theory about TreeEdges <= TreeNodes
isn't quite right, there are a more sensible balance of TreeEdge
s and TreeNode
s. Of the other relevant objects, there are 0.9M FDRIMS, BitSet & long[] (c.f. 11M above).
As suspected, the problem is in 0e7f032823b3eba0fb8d6906f82ba995f80c6fde. My "analysis" above is all highly speculative. I haven't actually looked at the code yet (and won't now get to it until tomorrow at the earliest).
Steve, are you able to comment?
Before we used to create a TreeEdge AND a TreeNode, with the state being in the TreeNode. Now we create only the TreeEdge (when the node is created later the state is just referencing the same object). HOWEVER, as you say, TreeEdge's were limited by the fixed pool size of TreeNodes before, whereas now they are not, so even though we have strictly LESS memory for a given number of expansions, we can wind up doing more expansions before we trim, which must be what is causing your problem. We're probably going to have to monitor the edge count too, and trim based on that getting too large. The only other alternative I can see is to not store the state at all and have to recalculate it every time on select, which would slow us down a lot.
I'll think on it, but any ideas would be welcome...
On Wed, Jun 4, 2014 at 4:47 PM, Andrew Rose notifications@github.com wrote:
Some time in the last few days, I've started getting OutOfMemoryErrors. Eventually (after ~1 hour of thrashing the disk), I managed to get a heap dump loaded up in MAT. Here's what it showed...
[image: image] https://cloud.githubusercontent.com/assets/3194913/3180242/9e7a1eb4-ec2e-11e3-93b2-0e3ec75c867c.png
Nothing else is >100MB. (The next largest item is 60MB.) These items account for ~2.7GB - which is 94% of my available heap (3GB).
Cracking out sizes of objects...
22,414,100 TreeEdge x 40 bytes = 897M (30%) 22,414,194 FDRLMI[] x 24 bytes = 538M (18%) 11,095,026 long[] x 32 bytes = 355M (12%) 11,094,896 FDRIMS x 32 bytes = 355M (12%) 11,094,975 BitSet x 24 bytes = 266M (9%) 1,278,188 TreeNode x 112 bytes = 143M (5%) 3,834,663 double[] x 35 bytes = 133M (4%) 1,054,375 TreeEdge[] x 104 bytes = 110M (4%)
I'm going to start chopping the check-in history to find out where this happened but my prime suspect is 0e7f032 https://github.com/SteveDraper/ggp-base/commit/0e7f032823b3eba0fb8d6906f82ba995f80c6fde (allocating nodes on demand). It added a ForwardDeadReckonInternalMachineState to a TreeEdge. That's bad because there are lots of TreeEdges (22M) of which half (11M) seem to have an associated ForwardDeadReckonInternalMachineState. TreeEdges are unlimited, whereas nodes are pegged to 2M (and I only got to 1.3M before everything fell over in a big heap (no pun intended)). I used to be able to get the full 2M nodes without the sky falling in.
Also, I haven't examined the code, but presumably we're creating TreeEdges for all moves, even before we create the child TreeNodes. So, previously, each TreeNode had 1 (or more - this is a DAG) inbound TreeEdges meaning that there were no more TreeEdges than nodes. (When we binned a TreeNode, did we bin the TreeEdge(s) leading to it?) Whereas now, there can be way more TreeEdges than TreeNodes because we pre-create TreeEdges before potentially expanding them into TreeNodes?
Whatever the cause, I'm completely dead in the water at the moment, even on my more powerful machine.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84.
I agree with your earlier analysis, that edges used to be limited by the node pool, but are now not (well to within a factor of the branching factor anyway). I think the answer is to track the number of edges in circulation, and to trim when it exceeds a certain amount. Trimming nodes will also trim edges, but rather than trimming sparsely I think we need to modify the trimming algorithm to identify entire sub trees (rather than individual leaf nodes) to trim, and always trim so as to leave the parent of the trimmed tree as a (once again) totally unexpanded node. That approach should have several advantages:
1) it reduces edge count effectively 2) it doesn't leave partially trimmed nodes scattered around, which simplifies out an entire case
I wanted to address (2) anyway, because I suspect it is behind our poor play in some games. From the graphs I have noticed that these correlate quite well with games in which we fill the node pool (and hence trim a lot). They prevent ucb selection through nodes with trimmed children (selection from such a node is just a random choice between the untrimmed children).
I can see if I can get anywhere with this tomorrow if you don't get there first. I spent today reworking the fast propagator as I mentioned at the weekend, and that seems to have been successful (only 5-10% faster, but it all helps), so (subject to any issues that show up overnight on Tiltyard) that should leave me free to look at something else (such as edge trimming).
Sent from my iPad
On Jun 4, 2014, at 5:26 PM, Andrew Rose notifications@github.com wrote:
As suspected, the problem is in 0e7f032. My "analysis" above is all highly speculative. I haven't actually looked at the code yet (and won't now get to it until tomorrow at the earliest).
Steve, are you able to comment?
— Reply to this email directly or view it on GitHub.
I'm possibly missing something here, but consider this.
The old way of doing things was...
(Selected node in blue, expanded node that we're going to depth charge from in red, uselessly expanded nodes in white.)
The new way of doing things is...
So we've lost the uselessly expanded nodes, but we still have a lot of useless edges which (a) are no longer capped at 2M and (b) have been given a machine state which adds considerably to their size.
Is there any good reason why we can't move to this...?
The array of children in the blue node simply has null values for the unexpanded children. A further selection through the blue node would expand one of those children.
(Previously expanded node in yellow.)
That would roughly double selection processing time. The reason is that everywhere we select to will end in a leaf, which will require us to determine the state of the newly selected leaf node during selection (previously it would have been determined when the edge was created when the state machine was already in the correct state in order to generate the legal moves). Basically it means propagating each node's state twice (once on expansion to determine legal moves, and then again on selection to determine child state). Currently the edges act as a cache to avoid this.
However, I had an idea last night that would allow us to trade off performance for memory much more smoothly (and much less performance loss). Suppose we DO allocate the edges, but instead of keeping a state ref in the edge (which is the really big cost) we instead keep allocated states in a large (64K ish sort of size) ring buffer, and put a sequence number in the ruing buffer in the edge. That way if the edge was a recent one (within the ring buffer size in expansion history) then on select we'll be able to retrieve the state from the ring buffer and suffer no performance penalty. In the (much rarer, since MCTS will cause the majority of expansions to occur in 'hot' areas) case where the selected edge has a sequence number that is older than the ring buffer length relative to the current ring buffer allocation seq we can reconstruct it from the state machine. This should be pretty easy to do.
Steve
On Thu, Jun 5, 2014 at 3:16 AM, Andrew Rose notifications@github.com wrote:
I'm possibly missing something here, but consider this.
The old way of doing things was...
[image: image] https://cloud.githubusercontent.com/assets/3194913/3184697/adcd7690-ec88-11e3-9741-31b828ed88ed.png
(Selected node in blue, expanded node that we're going to depth charge from in red, uselessly expanded nodes in white.)
The new way of doing things is...
[image: image] https://cloud.githubusercontent.com/assets/3194913/3184717/01358656-ec89-11e3-8886-74e579d9f756.png
So we've lost the uselessly expanded nodes, but we still have a lot of useless edges which (a) are no longer capped at 2M and (b) have been given a machine state which adds considerably to their size.
Is there any good reason why we can't move to this...?
[image: image] https://cloud.githubusercontent.com/assets/3194913/3184761/b22629ac-ec89-11e3-9bb4-f82dc5fb6729.png
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45192652.
That would roughly double selection processing time.
Really? Presumably on node expansion currently, the operations are...
So for each legal, we propagate twice - once to load S and once to apply L and get the next state. All I'm suggesting is that we don't do all of them up-front. Why is it more expensive to load S when we come to it again in the future than it is to re-load it as part of creating all the TreeEdge
s?
No. It goes: 1) Propagate state S 2) Query legals (no extra propagation) 3) For each legal query next state (very minimal extra propagation because it is just the move prop that changes NOT the base props - this costs much less than imposing the base props in almost all games) and store it in the edge
There is only one propagation of base props to generate the legals and all the successor states. If we don't cache the states in the edges (or similar - see other emails) then we have to re-propagate the base props to calculate each child state. That is significantly more expensive. Note however, that if we do not know a priori (via depth from initial) that the successor state cannot be terminal, we also have to propagate the successor states at expansion time to determine terminality, and that cost applies in either scheme.
The argument can be made (at least for games where mostly we do not have a priori knowledge of non-terminality) that successor state terminality determination dominates over successor state determination itself, and in THAT case having to re-determine the successor state on selection is a small overhead, since on any selection we'll be doing an expansion as well. For that reason I don't plan to bother with the ring buffer (see earlier email) until I've seen how much impact having to do the extra propagation to determine state on selection has. My expectation is that the impact will actually be low in most games (where we have to calculate terminality of successors anyway), but high in games where we don't (Dot&Boxes, Reversi, ...). Adding the ring-buffer caching is therefore a secondary concern since it would impact only a subset of games.
Steve
On Thu, Jun 5, 2014 at 8:16 AM, Andrew Rose notifications@github.com wrote:
That would roughly double selection processing time.
Really? Presumably on node expansion currently, the operations are...
- Get the legals in the selected node, S (which we calculated at the time that node was expanded)
- For each legal move L
- Load the state for S
- Get the next state as a result of doing L and store it in the edge (+ also the child node if this is the 1 node we're choosing the expand through right now)
So for each legal, we propagate twice - once to load S and once to apply L and get the next state. All I'm suggesting is that we don't do all of them up-front. Why is it more expensive to load S when we come to it again in the future than it is to re-load it as part of creating all the TreeEdges?
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45217343.
Ah, I think I understand now. In step (3), you don't propagate the new child state, you just read it off the transitions? Therefore, there's nothing to reset to get back to the original state (in order to do the next legal).
FWIW, I quite like the ring buffer idea, but agree that it's a nice extra.
Ok. I'll aim to at least have the basic no-states-in-edges (without the ring buffer) done today. Suggest you take the re-use state in node on reset/re-use optimization meanwhile, since you're (presumably) held up from ongoing ELB work until this is resolved.
Steve
On Thu, Jun 5, 2014 at 10:12 AM, Andrew Rose notifications@github.com wrote:
Ah, I think I understand now. In step (3), you don't propagate the new child state, you just read it off the transitions? Therefore, there's nothing to reset to get back to the original state (in order to do the next legal).
FWIW, I quite like the ring buffer idea, but agree that it's a nice extra.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45231864.
Running with the latest develop branch code.
This was a game of D&B with 15s deadlines. Although I'm not missing the deadlines now (nor suffering OutOfMemoryErrors), I am still having considerable occupancy and GC problems.
Some observations...
I have a hunch that expansion rate (before it becomes erratic) is considerably better than before. +50%??
I'm going to try two things.
I'm testing a version that no longer allocated Edges at all until they are selected through. Hope to have that ready tonight (but not sure - it was a rather fiddly change)
On Fri, Jun 6, 2014 at 3:31 PM, Andrew Rose notifications@github.com wrote:
Running with the latest develop branch code.
This was a game of D&B with 15s deadlines. Although I'm not missing the deadlines now (nor suffering OutOfMemoryErrors), I am still having considerable occupancy and GC problems.
[image: image] https://cloud.githubusercontent.com/assets/3194913/3205502/841d5498-edb7-11e3-807c-28be2047d433.png
Some observations...
- At the end of turn 9, I have completely filled my heap.
- Until that point, GC has been excellent (<100ms).
- Memory usage drops off steadily from turn 9.
- Until the end game, I never manage to use more than 33% of the node pool, and usually peaking at more like 20%.
- GC gets progressively worse throughout the game.
- As GC gets worse, expansion rate becomes increasingly erratic.
I have a hunch that expansion rate (before it becomes erratic) is considerably better than before. +50%??
I'm going to try two things.
- Firstly, I've always run with a target GC time of 800ms. I now gather the default is 800ms, so I'm going to try with that.
- Secondly, I clearly need to run a game with a longer clock. Tiltyard can easily give a minute by the time it's been a bit late getting back and I need to now that we won't suffer a disaster.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45381058.
Using the default target (i.e. not specifying a target), I get smoother performance later on (mostly peaks of 500 with occasional peaks of 800 rather than frequent peaks of 800 with occasional peaks pushing on for a second as seen in the last graph). However, I suffered two massive collections in turns 8 and 13 which were lucky not to cause a missed deadline.
Increasing the deadline to 1 minute (keeping the 200ms GC pause target) has pushed me over the edge. I missed the deadline at the end of turn 2 (and, by the looks of it, was fortunate not to miss it at the end of 7, 8 & 9 too).
I managed to fill my node table once but, due to GC, have typically only been getting to 75% full subsequently. Also, I'm repeatedly crashing into the ceiling in terms of memory allocation (which is what's triggering the big collections).
I'm looking forward to the improvements that you're adding now.
Once I've picked up your changes (and you've finished working in the area), I'm planning to add pooling to any frequently allocated objects that we create and discard. I'll probably create unbounded pools for them, because the I think they're now limited by the number of nodes. I just want to be recycling them rather than throwing them away, buying new ones & paying the price in GC. I can see my blog post title already - "Sancho goes Green"!
Ok - I have some ideas for more efficiently pooling / laying out some of the MCTS tree structure stuff (edges etc) which it would be worth discussing tomorrow. Would be nice to get a cache-friendly layout of the frequent access patterns at the same time
On Fri, Jun 6, 2014 at 4:17 PM, Andrew Rose notifications@github.com wrote:
Increasing the deadline to 1 minute (keeping the 200ms GC pause target) has pushed me over the edge. I missed the deadline at the end of turn 2 (and, by the looks of it, was fortunate not to miss it at the end of 7, 8 & 9 too).
[image: image] https://cloud.githubusercontent.com/assets/3194913/3205982/99f38222-edbe-11e3-8225-acbf627e0a48.png
I managed to fill my node table once but, due to GC, have typically only been getting to 75% full subsequently. Also, I'm repeatedly crashing into the ceiling in terms of memory allocation (which is what's triggering the big collections).
I'm looking forward to the improvements that you're adding now.
Once I've picked up your changes (and you've finished working in the area), I'm planning to add pooling to any frequently allocated objects that we create and discard. I'll probably create unbounded pools for them, because the I think they're now limited by the number of nodes. I just want to be recycling them rather than throwing them away, buying new ones & paying the price in GC. I can see my blog post title already - "Sancho goes Green"!
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45385360.
It's probably instructive to see the end of that last game (which has just finished). I saw that same steady decline in heap usage again, which meant that I didn't hit the ceiling and trigger another long GC pause. Peaks were ~1 second.
It spent considerable periods with a full node table without any apparent negative side effects.
Sounds like we're making progress :-)
On Fri, Jun 6, 2014 at 4:45 PM, Andrew Rose notifications@github.com wrote:
It's probably instructive to see the end of that last game (which has just finished). I saw that same steady decline in heap usage again, which meant that I didn't hit the ceiling and trigger another long GC pause. Peaks were ~1 second.
It spent considerable periods with a full node table without any apparent negative side effects.
[image: image] https://cloud.githubusercontent.com/assets/3194913/3206267/4ac4701c-edc3-11e3-9c22-e602508e8e68.png
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45387596.
And with your latest changes, this is looking much better on the occupancy front. Here's a 15s game of D&B (200ms GC target).
Now I just need to add pooling of the remaining objects to see if I can deal with the GC.
We're now pooling TreeNodes and TreeEdges, but what else do we churn through? Here's a snapshot of everything that has been garbage collected after running a game of C4 with 60s deadlines for a few minutes.
This is all objects where we've collected at least 100,000 instances. (The next item on the list had 23K instances collected).
Dealing with TreeNodeRefs was already on my list (because they're relatively easy to deal with). However, it looks like by far the biggest nightmare here (by size and by instance count) is the iterator in FDRIMS.
Here's the culprit...
@Override
public Iterator<ForwardDeadReckonPropositionInfo> iterator()
{
return new InternalMachineStateIterator(this);
}
Steve - I have a recollection that you did something to re-use iterators in some other case. Might the same techniques apply here? I'll fix, but if you can easily give me a nudge in the right direction, please do.
Yes. The key observation (for the others, which were ForwardDeadReckonLegalMoveSetIterator) was that the parent collection was only ever used by one thread, and the call pattern was always such that there were no nested enumerations of the same collection. Consequently the parent collection could allocate a single iterator and reuse it (calling some form of reset() on it each time it passes it out). However, this doesn't apply here, because machine states are passed to rollouts as starting points to roll out from, and hence are used from multiple threads. MOST usages will be within a single thread, but not all, so unless we can figure out a way to restrict the case (perhaps with some explicit method to retrieve a non-thread-safe (singleton) iterator or something) we cannot use the same technique. The problem with iterators in Java is that the usage pattern does not tell you when the caller is done with it.
Most of the enumerations that use these iterators are in one place - in ForwardDeadReckonPropNetStateMachine.setBasePropositionsFromState(). All of THOSE uses ARE guaranteed to be on the same thread for any given state machine, so if we could create an iterator 'uncoupled' from its parent container, then 'attach' it, we could create a singleton in the state machine, attach it to the InternalMachineState instance and then enumerate (the attach would set it to the first element). That way the major usage would not require allocation.
On Tue, Jun 10, 2014 at 3:01 PM, Andrew Rose notifications@github.com wrote:
Here's the culprit...
@Override public Iterator
iterator() { return new InternalMachineStateIterator(this); } Steve - I have a recollection that you did something to re-use iterators in some other case. Might the same techniques apply here? I'll fix, but if you can easily give me a nudge in the right direction, please do.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45663963.
Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...
The next biggest (LinkedList$Node) has 4.7M.
Note that, due to a limitation of the tool, the above does not include array allocation.
The intermediary list of moves in TreeNode.expand() which we talked about at the weekend will be a big one. That's trivial to remove now (getLegalMoves() returns a collection which we happen to be treating just as an iterable, and if you start treating it as a collection (so you can call size()) you'll be able to directly use it I think. The call to remove() is not needed any more and the loop it's in can be a simple iterator over the collection now instead of the explicit indexing.
I can do this easily I think if you prefer me to tackle that one...?
On Tue, Jun 10, 2014 at 4:14 PM, Andrew Rose notifications@github.com wrote:
Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...
[image: image] https://cloud.githubusercontent.com/assets/3194913/3237095/1aadb8ce-f0e4-11e3-9a97-29f67b7c3548.png
The next biggest (LinkedList$Node) has 4.7M.
Note that, due to a limitation of the tool, the above does not include array allocation.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45672788.
BTW - I don't think you need the overhead of the ThreadLocal. Any given instance of the state machine is guaranteed to only be used on one thread at once.
On Tue, Jun 10, 2014 at 4:14 PM, Andrew Rose notifications@github.com wrote:
Okay, so having dealt with the Iterator (using a single per-thread Iterator conveniently squirrelled away in a ThreadLocal), the big swingers are...
[image: image] https://cloud.githubusercontent.com/assets/3194913/3237095/1aadb8ce-f0e4-11e3-9a97-29f67b7c3548.png
The next biggest (LinkedList$Node) has 4.7M.
Note that, due to a limitation of the tool, the above does not include array allocation.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-45672788.
Here we are again, this time with array allocations too.
So it looks like we're allocating 2 long[] per FDRIMS. I think they'll be accounted for by the BitSet in the FDRIMS (although some of the maths isn't quite right).
Also, approximately as many boolean[] and double[] as there are FDRIMS (although I'm not expecting them to be directly related).
Oh - I think the reason we're allocating 2 long[] per FDRIMS is because we don't tell the BitSet what size it should be. As a result, it allocates an array of size 1 and then, when we set the first proposition > 64, it has to reallocate the array.
Also, the graph above shows a relatively rosy picture of FDRIMS allocation. In any game with the piece heuristic, it would be much worse (because we create temporary FDRIMS to do some calculations).
The next commit should fix both of those.
Okay, the "standard profile" for measurement runs is C4 with 30/60 deadlines, recording for 1 minute from approximately the middle of the second turn.
Here's one from before the TNR fix.
And here's one on develop. Note that, although I've aimed to get some comparable absolute figures, the profiler so massively affects performance (we get ~1/10th of the rollout rate when it's running) that the figures aren't really directly comparable. The real benefit is in the difference between numbers of objects.
So the TNR has been completely removed from GC (obviously, the class doesn't even exist any more) and there are now the same number of long[] collections as there are OpenBitSet (not BitSet) and FDRIMS collections. I also hope that some of the reduction in state machine allocations is because we no longer record temporary ones, but as per the comment above, I'm not hopeful that the numbers are really directly comparable.
In C4, I think the key allocation site of FDRIMS is...
...and a possible solution is for mChildStatesBuffer
(in MCTSTree
) to be a pre-allocated array of re-usable FDRIMS. TreeNode.expand
(the only user of mChildStatesBuffer
) would reset the state and then pass it to a new version of FDRPSM.getNextState
taking an extra FDRIMS as the object to create the new state into. The would call through to a new version of FDRPSM.getInternalStateFromBase
which against would take an FDRIMS as a parameter and re-use it, rather than reallocating. Finally, expand() would copy the resultant state into an FDRIMS that lives permanently in the new child TreeNode
(rather than changing the variable to refer to a newly allocated object).
Beware that the key allocation site may vary in other games. The following factors may well be relevant.
Sounds like a good plan. Only thing to potentially beware of when implementing is that the last state applied is stored (a ref to it that is in lastInternalSetState) by setBasePropositionsFromState(). There is already code there (conditionally invoked dependent on he 'isolate' parameter) to make this an actual copy rather than just a re-use by reference, but its something to watch out for.
By doing the plan above, we now have significantly fewer allocations of FDRIMS, but there are still a lot of them. FDRIMS used to be collected at 130% of the rate at which double[]s were collected. Now it's just 60%. I haven't really managed to get a grip of JProfiler to figure out where the remaining key allocation sites are.
Okay, so I think the remaining allocation sites for FDRIMS (in C4) are...
RolloutProcessor
GameSearcher
Can you tell what the ratio of FDRIMS to node allocations is (the latter are pooled so obviously not from GC stats)?
The main source is likely to be the construction in getNextState(ForwardDeadReckonInternalMachineState, Factor, ForwardDeadReckonLegalMoveInfo[]) - obviously there will be the state in each node.
Apart from that what game are you using? If it uses greedy rollouts then the allocation of decisionState.state during playouts at each state will likely dominate (and can probably be engineered away - the greedy rollout process is not terribly efficient)
Steve
On Fri, Jun 13, 2014 at 2:44 PM, Andrew Rose notifications@github.com wrote:
By doing the plan above, we now have significantly fewer allocations of FDRIMS, but there are still a lot of them. FDRIMS used to be collected at 130% of the rate at which double[]s were collected. Now it's just 60%. I haven't really managed to get a grip of JProfiler to figure out where the remaining key allocation sites are.
[image: image] https://cloud.githubusercontent.com/assets/3194913/3274704/e09fd580-f332-11e3-8531-41a31c32884f.png
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46052283.
In non-greedy-rollout games (e.g. 9B-TTT), FDRIMS is now way down the list of things we care about.
The remaining allocations are done as per the GameSearcher stack above.
Next on the hit list must be all those double[]
s and TreePathElement
s. Mustn't forget to return to the greedy rollout scenarios though.
I can take a look at that this weekend if you like since I'm probably more familiar with that code.,,,?
On Fri, Jun 13, 2014 at 3:35 PM, Andrew Rose notifications@github.com wrote:
In non-greedy-rollout games (e.g. 9B-TTT), FDRIMS is now way down the list of things we care about.
[image: image] https://cloud.githubusercontent.com/assets/3194913/3275227/3162ce0e-f339-11e3-8162-c2dbc28cd39a.png
The remaining allocations are done as per the GameSearcher stack above.
Next on the hit list must be all those double[]s and TreePathElements. Mustn't forget to return to the greedy rollout scenarios though.
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46056945.
Steve,
Not sure exactly which of the above you're offering to look at. If it's the greedy rollouts, then feel free - but do familiarise yourself with what I've done first.
I'm about to check in an enhancement that should remove FDRIMS allocations from the non-greedy-rollout case (createChildNodeForEdge
). I'm definitely planning to deal with TreePathElement
s myself (I know how I'm going to do it) and plan to at least look at where all the double[]
s are coming from.
Andrew
Yes, I meant the greedy rollouts, and I'll make sure I understand what you've done already first ;-)
On Fri, Jun 13, 2014 at 4:15 PM, Andrew Rose notifications@github.com wrote:
Steve,
Not sure exactly which of the above you're offering to look at. If it's the greedy rollouts, then feel free - but do familiarise yourself with what I've done first.
I'm about to check in an enhancement that should remove FDRIMS allocations from the non-greedy-rollout case (createChildNodeForEdge). I'm definitely planning to deal with TreePathElements myself (I know how I'm going to do it) and plan to at least look at where all the double[]s are coming from.
Andrew
— Reply to this email directly or view it on GitHub https://github.com/SteveDraper/ggp-base/issues/84#issuecomment-46060621.
Fixed bug in machine state reuse that was causing failure when a pre-allocated instance was used in creating new child nodes
Excellent - in 9BTTT there are now no FDRIMS collections.
We no longer collect any TreePath
s or TreePathElement
s. Also, there are no ArrayList
s and very few ArrayList$Itr
ators.
The real big one remaining is all those double[]
s which I'll look at next.
General allocation hotspots (all objects) are TreeNode.updateStats
and TreeNode.expand
and to a lesser extent, CombinedHeuristic.getHeuristicValue
(due to iteration).
double[]
s dealt with. Iterators are next most common.
In 9BTTT, these are the only 4 objects of any consequence that still get collected.
HashMap$KeyIterator
s are coming from.TreeEdge
s are leaking. I knew that I hadn't found all the places that they could become garbage. I'll have another hunt. If I find some relevant places, it's just a case of freeing them back to the pool when they're no longer needed.Object[]
s are the children of TreeNode
s. I have a plan for them.TreeNode
s from the transposition table when we recycle them, there's nothing we can do about it. They could conceivably also be to do with Heuristics and there might be something we can do about that.Remaining things that I plan to address under this bug if I can...
And here are the key allocation sites.
And here are the key allocation sites.
For 9BTTT, as far as I can make out, the above tells me that 9.1% of the allocations are HashSet.iterator and are being allocated directly from GameSearcher.run. A further 16.7% are the same, being allocated from GameSearcher.expandSearch.
The 23.9% generated by HashMap.put is unavoidable (without writing a HashMap replacement that doesn't do runtime allocation of resources, which isn't totally daft but isn't a target for this issue).
Aha - culprits located.
From GameSearcher.run()
...
synchronized(getSerializationObject())
{
// Must re-test for a termination request having obtained the lock
if (!mTerminateRequested)
{
for (MCTSTree tree : factorTrees)
{
tree.gameCharacteristics.setExplorationBias(currentExplorationBias);
}
complete = expandSearch(false);
}
}
...and from GameSearcher.expandSearch()
, both...
for (MCTSTree tree : factorTrees)
{
if (!tree.root.complete)
{
// The trees may have very asymmetric sizes due to one being nearly
// complete, in which case it is possible that no candidates for trimming
// will be found. This should not be possible in all trees if the node pool
// is nearly full, so check that at least one tree does release something
somethingDisposed |= tree.root.disposeLeastLikelyNode();
}
}
...and...
for (MCTSTree tree : factorTrees)
{
if (!tree.root.complete)
{
if (ThreadControl.ROLLOUT_THREADS > 0 && !forceSynchronous)
{
// If there's back-propagation work to do, do it now, in preference to more select/expand cycles because the
// back-propagation will mean that subsequent select/expand cycles are more accurate.
processCompletedRollouts(false);
}
// Perform an MCTS iteration.
lAllTreesCompletelyExplored &= tree.growTree(forceSynchronous);
}
}
All of these iterate over factorTrees
which is a HashSet
. I reckon there's no good reason why it can't be a plain array.
GC on 9BTTT is all done now. Garbage rate reduced to 1.4MB/s (which is all HashEntry).
The remaining C4-Simultaneous allocations are in the somewhat grotty class FactorFilteredForwardDeadReckonLegalMoveSet
. This implements a collection in a somewhat inefficient way...
I think the whole thing could be replaced by a static method which takes the current constructor args and returns an array. Then, rather than allocating an array, it could take a "static" array and write the contents into it, returning the number of elements that are valid (i.e. the effective length).
Sigh - there's another game that needs some attention.
Breakthrough (Small) - and possibly all games with PieceHeuristic enabled and/or greedy rollouts?
Here's where it's all being allocated. Note that it's effectively the same call-stack twice. Once from the rollout threads and once from the game-search thread (which must be doing rollouts because the pipeline is full).
So the to-do list here is...
If there's any chance that there are more categories of game that haven't been exercised yet, now's the time to spot them before the evaluation of JProfiler comes to an end. (I can grab the information as shown above and then fix at my leisure.) I've got 2 days left (and one of those, I've got something on in the evening, so won't be able to progress this).
C4 Simultaneous dealt with. Still need to look at greedy rollout and/or piece heuristic games. Also, have a trawl through the last 24 hours of play to see if any other games are producing significant garbage.
Some time in the last few days, I've started getting
OutOfMemoryError
s. Eventually (after ~1 hour of thrashing the disk), I managed to get a heap dump loaded up in MAT. Here's what it showed...Nothing else is >100MB. (The next largest item is 60MB.) These items account for ~2.7GB - which is 94% of my available heap (3GB).
Cracking out sizes of objects...
I'm going to start chopping the check-in history to find out where this happened but my prime suspect is 0e7f032823b3eba0fb8d6906f82ba995f80c6fde (allocating nodes on demand). It added a
ForwardDeadReckonInternalMachineState
to aTreeEdge
. That's bad because there are lots ofTreeEdge
s (22M) of which half (11M) seem to have an associatedForwardDeadReckonInternalMachineState
.TreeEdge
s are unlimited, whereas nodes are pegged to 2M (and I only got to 1.3M before everything fell over in a big heap (no pun intended)). I used to be able to get the full 2M nodes without the sky falling in.Also, I haven't examined the code, but presumably we're creating
TreeEdge
s for all moves, even before we create the childTreeNode
s. So, previously, eachTreeNode
had 1 (or more - this is a DAG) inboundTreeEdge
s meaning that there were no moreTreeEdge
s than nodes. (When we binned aTreeNode
, did we bin theTreeEdge
(s) leading to it?) Whereas now, there can be way moreTreeEdge
s thanTreeNode
s because we pre-createTreeEdge
s before potentially expanding them intoTreeNode
s?Whatever the cause, I'm completely dead in the water at the moment, even on my more powerful machine.