Avoid creating improper regions after loop versioner

0xdaryl commented 3 years ago

The following design was implemented in loop versioner many years ago, but never enabled. There are situations where the presence of improper regions is inhibiting some other optimizations (such as global value propagation) from finding opportunities. This design should prevent that.

The goal of this issue is to find, modernize, test, and enable this logic to ensure loop versioner does not create improper regions.

@a7ehuo @vijaysun-omr

Sometimes, loop versioner performs 'loop transfer', where control flow is changed such that if a virtual guard fails in the 'fast' version of the loop, the virtual call in the 'slow' version will be reached. This this has the benefit that the 'fast' version of the loop does not contain the virtual call anymore, meaning more aggressive optimization is possible (more expressions are loop invariant for example). However this transformation almost always leads to an improper region being created, since control is transferred to a point in the 'slow' loop that is not the entry of the region (the virtual call block is a successor of the virtual guard block, which is also in the 'fast' loop).

One solution for this problem would be to create an 'artificial' entry point in the 'slow' version of the loop, where all the failing virtual guards branch to. All the failing virtual guards store a different value into a temp on the failing path, and there is a switch in the 'artificial' entry that checks the value of the temp to decide where we came from, and then it branches to the appropriate virtual call block within

High-Level Design

Previously, loop-transfer used to redirect control (on the failing path) from the virtual guard in the loop directly to the virtual call block inside the cold version of the loop. This essentially created two entry points to the cold loop leading to the improper region. The failing paths of each virtual guard now store a unique value into a temporary and instead jump to a selector at the entry of the cold version of the loop. The selector is basically a switch statement which redirects control to the appropriate virtual call block in the cold version of the loop.

For example,

Before loop transfer:

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------
          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \  (fail)                  (fail)  /             \  (success)
           /              \                                /               \
     inlined code      virtual call                  virtual call       inlined code

After loop transfer:

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------
          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \                          (fail)  /             \  (success)
           /              \                                /               \
     inlined code          ------------------->       virtual call       inlined code

This has now created two entry points to the cold loop leading to an improper region.

New loop-transfer:

                                                     L1: (new entry)
                                                         switch(t)
                                                        default --> goto start 
                                                              1 --> goto L2

             hot loop                                  cold loop (cloned of hot)
           -----------                                -------------------------

          .....                                           ..... 
         if (virtual-guard)                              if (virtual-guard)
(success)   /            \  (fail)                  (fail)  /             \  (success)
           /              \                                /               \
     inlined code      t = 1                    L2:  virtual call       inlined code
                       goto L1

Multiply nested loops pose a certain difficulty for this implementation. Since loop versioner processes loops outer-first, this requires some extra book-keeping to get the flow of control right:

when an inner loop contains a vguard that is going to undergo loop transfer, the target of the hot vguard is changed to branch to the switch created at the head of the outer loop. but the switch now jumps directly into the middle of the inner loop. to prevent this, the switch has to jump to the entry of the inner loop. The entry has another selector to transfer control appropriately. some book-keeping is first required to determine which inner loop the control has to be transferred to. once that is done, the right vguard target to jump to in the inner loop has to be determined. as a result, two temporaries are required - one to hold the value of the loop to jump to & the other to figure out the vguard target within the loop.

                                                 L1: (new cold outer entry)
                                                     switch(t1)
                                                    default --> goto start 
                                                          1 --> goto L2

    hot outer loop                                  cold outer loop
   ---------------                                 ----------------  
                                                 L2: (new cold inner entry)
                                                     switch(t2)
                                                    default --> goto start 
                                                          1 --> goto L3

     hot-inner                                        cold-inner
     ---------                                        ----------

   if (virtual-guard)                              if (virtual-guard)
(success) /             \ (fail)                   (fail) /             \  (success)
     /               \                               /               \
  inlined code       t1 = 1                  L3:  virtual call        inlined code
                     t2 = 1 
                     goto L1

Internals Description

Here is a pseudocode of the algorithm.

VirtualGuardInfo is used to record info about a loop; VirtualGuardPair records the info about virtual guards in the loop.

struct VirtualGuardInfo : public TR_Link<VirtualGuardInfo>
   {
   VirtualGuardInfo(TR_Compilation *c)
      :_virtualGuardPairs(c->trMemory())
      {

      _virtualGuardPairs.deleteAll();
      _coldLoopEntryBlock = NULL;
      _coldLoopInvariantBlock = NULL;
      _loopEntry = NULL;
      }
   List<VirtualGuardPair> _virtualGuardPairs;
   TR_Block *_coldLoopEntryBlock;
   TR_Block *_coldLoopInvariantBlock;
   TR_Block *_loopEntry;
   };

versionNaturalLoop() performs the cloning of the loop. The implementation uses this routine to perform the book-keeping of the virtual guards so that fixup can be done at the end of the opt.

if (lastTree->getOpCode().isIf() &&
    blocksInWhileLoop.find(lastTree->getBranchDestination()->getNode()->getBlock()) &&
          lastTree->isTheVirtualGuardForAGuardedInlinedCall())
         {
         // create the virtual guard info for this loop
         //
         if (!vgInfo)
            {
            vgInfo = new (trStackMemory()) VirtualGuardInfo(comp());
            vgInfo->_loopEntry = blockAtHeadOfLoop;                     // record entry of the hot loop
            _virtualGuardInfo.add(vgInfo);
            }

         VirtualGuardPair *virtualGuardPair = (VirtualGuardPair *) trMemory()->allocateStackMemory(sizeof(VirtualGuardPair));
         virtualGuardPair->_hotGuardBlock = nextBlock;
         virtualGuardPair->_coldGuardBlock = nextClonedBlock;
         virtualGuardPair->_isGuarded = false;

         // check if the virtual guard is in an inner loop
         //
         TR_RegionStructure *parent = nextBlock->getStructureOf()->getParent()->asRegion();
         if ((parent != whileLoop) &&
               (parent && parent->isNaturalLoop()))
            {
            TR_Block *entry = parent->getEntryBlock();
            virtualGuardPair->_coldGuardLoopEntryBlock = entry;         //record entry of the inner loop
            virtualGuardPair->_isInsideInnerLoop = true;
            }
         else
            virtualGuardPair->_isInsideInnerLoop = false;

         vgInfo->_virtualGuardPairs.add(virtualGuardPair);
         ///_virtualGuardPair.add(virtualGuardPair);
         }

The routine performLoopTransfer() performs the actual loop-transfer; in turn calls fixupVirtualGuardTargets() to perform the fixup (creation of switch blocks etc.)

fixupVirtualGuardTargets(): 

for every vgInfo 
   {
   for (each vgPair) 
      { 
      if (vpPair->_isGuarded) 
         {
         if (vgPair->_isInnerLoop)
            record distinctTargets[_coldGuardLoopEntry->getNumber]++;  // used to build the inner switch block
         else
            caseKonst++;
         } 
      }

   create loopTemp symref (for the outer loop)

   numDistinct = 0;
   for (i = 0 to numNodesInCFG)
      if distinctTargets[i] != 0
         invariantBlock = locate the invariant block for _coldGuardLoopEntry
         create vgNumTemp = inner temp for the inner switch block
         store  vgNumTemp = distinctTargets[i] - vgNumTemp  // default value of the inner loop 
         invariantBlock->prepend(vgNumTemp)
         create the inner switchblock at _coldGuardLoopEntry
         numDistinct++

   create the outer switchblock at _coldLoopEntry with caseKonst+numDistinct children
   store the default value of loopTemp in the outer cold loop invariant block

   int32_t childNum = 0;
   for (each vgPair)
      {
      if (vgPair->_isGuarded)
         {
         gotoBlock = create the gotoblock to store values into the temps

         // fixup the targets of the inner switch block         
         if (vgPair->_isInnerLoop)
            {
            loopid = vg->_coldGuardLoopEntry->getNumber
            store loopid in loopTemp
            vgnum = distinctTargets[loopid]--
            fixup case target in the inner switch block child: (switch->getNumChildren-vgnum)
            setcaseconstant->distinctTargets[loopid] 
            setbranchdest->vgPair->_coldGuardBlock->getDest
            store vgnum in vgNumTemp
            gotoBlock->prepend(vgNumTemp)
            // reset the flag in the fail path of the virtual
            // guard in the inner cold loop
            }
         else
            {
            store childNum++ into loopTemp
            }
         gotoBlock->prepend(loopTemp)
         fixup case target in _coldLoopEntry switchblock childNum+1
         }
      }
   }

vijaysun-omr commented 3 years ago

Thanks @0xdaryl.

FYI @gita-omr I think I mentioned this work to you since it came up in the context of vector JEP JIT logs that we were looking at.

a7ehuo commented 3 years ago

I took a look at our current code at a high level and it looks like we have some of the code in place in OMR: TR_LoopVersioner keeps a linked list of VirtualGuardInfo: TR_LoopVersioner::_virtualGuardInfo. TR_LoopVersioner::versionNaturalLoop()adds loop transfer candidates to the linked list. TR_LoopVersioner::performLoopTransfer()performs the transformation, however, there is no implementation of fixupVirtualGuardTargets() to set up the switch table. The current code only has the guard in the hot loop jump to the guard in the cold loop which would still create an improper region.

 hotGuardBlock->changeBranchDestination(coldGuard->getBranchDestination(), cfg);

The current implementation of loop transfer can be disabled by TR_DisableLoopTransfer.

IBMJimmyk commented 1 year ago

This implementation looks like it assumes that there can only be one level of loop nesting. So this can be a problem:

Hot outer loop
Hot middle loop
Hot inner loop
-If virtual guard (A) fails, set flag to A and jump to L1

Cold outer loop switch block (L1)
-flag is set to A which means jump to cold inner loop switch block to be forwarded (L2)
Cold middle loop
Cold inner loop switch block (L2)
-jump to virtual call (cold side version of a failed guard for A)

This sequence will have the Cold outer loop switch block reach inside the Cold middle loop without going through the entry block of the Cold middle loop.

If my understanding is correct, this will still create an improper region and the design needs to be adjusted to support an arbitrary amount of nesting instead of just one layer. This should still be possible, but adds a fair bit of complexity.

IBMJimmyk commented 1 year ago

I finish a first draft version of my code changes. I also wrote up a small test to confirm that the basic case of a single guard inside in a single small loop works. I got the correct result and the generated trees seem to be correct.

I am currently expanding my testing to ensure all expected cases work including having multiple nested loops.

eclipse-openj9 / openj9

Avoid creating improper regions after loop versioner #13243

High-Level Design

Internals Description