SanchoGGP / ggp-base

The General Game Playing Base Package
8 stars 4 forks source link

Attempt to get(NULL_REF) from the node pool #271

Open arr28 opened 9 years ago

arr28 commented 9 years ago

When playing this Reversi match...

23:48:59.229 - Starting turn 42
23:48:59.408 - MCTS iterations last turn = 118650
23:48:59.408 - Free all but rooted in state: ( ( true ( cell 6 5 black ) ), ( true ( cell 5 4 black ) ), ( true ( cell 2 8 red ) ), ( true ( cell 6 7 red ) ), ( true ( cell 5 5 black ) ), ( true ( cell 4 4 black ) ), ( true ( cell 8 4 red ) ), ( true ( cell 7 4 black ) ), ( true ( cell 2 4 black ) ), ( true ( cell 8 6 black ) ), ( true ( cell 2 5 red ) ), ( true ( cell 5 7 black ) ), ( true ( cell 2 1 black ) ), ( true ( cell 3 5 black ) ), ( true ( cell 3 6 black ) ), ( true ( cell 7 5 black ) ), ( true ( cell 3 8 red ) ), ( true ( cell 5 6 black ) ), ( true ( cell 8 5 black ) ), ( true ( cell 6 3 black ) ), ( true ( cell 3 4 black ) ), ( true ( cell 3 3 black ) ), ( true ( cell 4 8 red ) ), ( true ( cell 6 8 red ) ), ( true ( cell 2 3 black ) ), ( true ( cell 3 1 black ) ), ( true ( cell 4 3 black ) ), ( true ( cell 5 3 black ) ), ( true ( cell 7 2 red ) ), ( true ( cell 5 1 black ) ), ( true ( cell 4 6 black ) ), ( true ( cell 3 2 black ) ), ( true ( cell 6 1 black ) ), ( true ( cell 1 3 black ) ), ( true ( cell 6 2 black ) ), ( true ( cell 4 5 red ) ), ( true ( cell 6 4 red ) ), ( true ( cell 2 6 red ) ), ( true ( cell 3 7 red ) ), ( true ( cell 4 2 black ) ), ( true ( cell 1 6 red ) ), ( true ( cell 6 6 red ) ), ( true ( cell 1 4 black ) ), ( true ( cell 5 2 black ) ), ( true ( cell 4 1 black ) ), ( true ( control red ) ) )
23:48:59.472 - Freed 14% of allocated nodes (29013 of 206868)
23:48:59.472 - Root has 6 children, and is not complete
Exception in thread "Search Processor" 23:49:17.755 - Exception in thread "Search Processor"
23:49:17.755 - java.lang.ArrayIndexOutOfBoundsException: -1
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.pool.CappedPool.get(CappedPool.java:133)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.get(TreeNode.java:5812)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.get(TreeNode.java:5829)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.checkChildCompletion(TreeNode.java:1640)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expandInternal(TreeNode.java:3805)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expandInternal(TreeNode.java:3213)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expand(TreeNode.java:2586)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.MCTSTree.selectAction(MCTSTree.java:976)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.MCTSTree.growTree(MCTSTree.java:809)
java.lang.ArrayIndexOutOfBoundsException: -1
    at org.ggp.base.player.gamer.statemachine.sancho.pool.CappedPool.get(CappedPool.java:133)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.get(TreeNode.java:5812)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.get(TreeNode.java:5829)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.checkChildCompletion(TreeNode.java:1640)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expandInternal(TreeNode.java:3805)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expandInternal(TreeNode.java:3213)
    at org.ggp.base.player.gamer.statemachine.sancho.TreeNode.expand(TreeNode.java:2586)
    at org.ggp.base.player.gamer.statemachine.sancho.MCTSTree.selectAction(MCTSTree.java:976)
    at org.ggp.base.player.gamer.statemachine.sancho.MCTSTree.growTree(MCTSTree.java:809)
    at org.ggp.base.player.gamer.statemachine.sancho.GameSearcher.expandSearch(GameSearcher.java:758)
    at org.ggp.base.player.gamer.statemachine.sancho.GameSearcher.run(GameSearcher.java:356)
    at java.lang.Thread.run(Thread.java:744)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.GameSearcher.expandSearch(GameSearcher.java:758)
23:49:17.755 -  at org.ggp.base.player.gamer.statemachine.sancho.GameSearcher.run(GameSearcher.java:356)
23:49:17.755 -  at java.lang.Thread.run(Thread.java:744)
23:49:20.929 - Average fringe depth: 21.095410826455797
23:49:20.929 - Fringe depth variability: 0.014464358589731318
23:49:20.929 - Average branching factor: 3.86305401619638
23:49:20.929 - Tree aspect ratio: 2.283546959663762
arr28 commented 9 years ago

We're trying to fetch a node with NULL_REF. Comments on the call stack...

  1. CappedPool.get - We're looking at array index -1 in the pool. Oops.
  2. TreeNode.get(CappedPool, long) - We would have asserted xiNodeRef != NULL_REF if assertions were enabled. (NULL_REF is -1.)
  3. TreeNode.get(long) just passes through to (2), using the same pool as the pool used for this node. I think we only have multiple pools when we have multiple factors and I don't think Reversi is considered to have multiple factors.
  4. TreeNode.checkChildCompletion. Okay, so here's where we must be going wrong. This is the code...
          //  No need to process hyper-edges to determine completion
          if ( edge.isHyperEdge() )
          {
            break;
          }

          if (edge.getChildRef() == NULL_REF)
          {
            allImmediateChildrenComplete = false;
          }
          else if (get(edge.getChildRef()) != null)
          {
            TreeNode lNode = get(edge.getChildRef());
            <...>

We're failing in the else if.

Have we hit a race condition where the edge has been freed between the check edge.getChildRef() == NULL_REF and the subsequent attempt to get the edge? Is that even something that can happen?

If so, the fix is simply to pull out the result of getChildRef into a variable and then use that. Then check for the same pattern elsewhere in the code to find similar bugs.

SteveDraper commented 9 years ago

Only one thread should be manipulating the tree so such a race condition should not be possible! If it is then we have a bug in our threading and likely need to chnage things around (or add locking, which we'd rather avoid).

In particular all completion, trimming etc. is done on the main search thread for exactly this reason. Perhaps we can add some debug code that stores the thread id of the search thread in a tree member and asserts it matches the current thread in node/edge alloc/frees as a way to debug this...

arr28 commented 9 years ago

Yeah - I realised that whilst I was out at lunch.

However, unless the code has changed and the code I'm looking at doesn't match the stack, it must be what happened. The other functions are so simple that nothing else should have happened.

arr28 commented 9 years ago

In the absence of a repro, not realistic to think that I'm going to get to the bottom of this before the competition. Also, since it has only been seen once, not especially important.