Open 0xdaryl opened 3 years ago
Thanks @0xdaryl.
FYI @gita-omr I think I mentioned this work to you since it came up in the context of vector JEP JIT logs that we were looking at.
I took a look at our current code at a high level and it looks like we have some of the code in place in OMR: TR_LoopVersioner
keeps a linked list of VirtualGuardInfo: TR_LoopVersioner::_virtualGuardInfo
. TR_LoopVersioner::versionNaturalLoop()
adds loop transfer candidates to the linked list. TR_LoopVersioner::performLoopTransfer()
performs the transformation, however, there is no implementation of fixupVirtualGuardTargets()
to set up the switch table. The current code only has the guard in the hot loop jump to the guard in the cold loop which would still create an improper region.
hotGuardBlock->changeBranchDestination(coldGuard->getBranchDestination(), cfg);
The current implementation of loop transfer can be disabled by TR_DisableLoopTransfer
.
This implementation looks like it assumes that there can only be one level of loop nesting. So this can be a problem:
Hot outer loop
Hot middle loop
Hot inner loop
-If virtual guard (A) fails, set flag to A and jump to L1
Cold outer loop switch block (L1)
-flag is set to A which means jump to cold inner loop switch block to be forwarded (L2)
Cold middle loop
Cold inner loop switch block (L2)
-jump to virtual call (cold side version of a failed guard for A)
This sequence will have the Cold outer loop switch block reach inside the Cold middle loop without going through the entry block of the Cold middle loop.
If my understanding is correct, this will still create an improper region and the design needs to be adjusted to support an arbitrary amount of nesting instead of just one layer. This should still be possible, but adds a fair bit of complexity.
I finish a first draft version of my code changes. I also wrote up a small test to confirm that the basic case of a single guard inside in a single small loop works. I got the correct result and the generated trees seem to be correct.
I am currently expanding my testing to ensure all expected cases work including having multiple nested loops.
The following design was implemented in loop versioner many years ago, but never enabled. There are situations where the presence of improper regions is inhibiting some other optimizations (such as global value propagation) from finding opportunities. This design should prevent that.
The goal of this issue is to find, modernize, test, and enable this logic to ensure loop versioner does not create improper regions.
@a7ehuo @vijaysun-omr
Sometimes, loop versioner performs 'loop transfer', where control flow is changed such that if a virtual guard fails in the 'fast' version of the loop, the virtual call in the 'slow' version will be reached. This this has the benefit that the 'fast' version of the loop does not contain the virtual call anymore, meaning more aggressive optimization is possible (more expressions are loop invariant for example). However this transformation almost always leads to an improper region being created, since control is transferred to a point in the 'slow' loop that is not the entry of the region (the virtual call block is a successor of the virtual guard block, which is also in the 'fast' loop).
One solution for this problem would be to create an 'artificial' entry point in the 'slow' version of the loop, where all the failing virtual guards branch to. All the failing virtual guards store a different value into a temp on the failing path, and there is a switch in the 'artificial' entry that checks the value of the temp to decide where we came from, and then it branches to the appropriate virtual call block within
High-Level Design
Previously, loop-transfer used to redirect control (on the failing path) from the virtual guard in the loop directly to the virtual call block inside the cold version of the loop. This essentially created two entry points to the cold loop leading to the improper region. The failing paths of each virtual guard now store a unique value into a temporary and instead jump to a selector at the entry of the cold version of the loop. The selector is basically a switch statement which redirects control to the appropriate virtual call block in the cold version of the loop.
For example,
Before loop transfer:
After loop transfer:
This has now created two entry points to the cold loop leading to an improper region.
New loop-transfer:
Multiply nested loops pose a certain difficulty for this implementation. Since loop versioner processes loops outer-first, this requires some extra book-keeping to get the flow of control right:
when an inner loop contains a vguard that is going to undergo loop transfer, the target of the hot vguard is changed to branch to the switch created at the head of the outer loop. but the switch now jumps directly into the middle of the inner loop. to prevent this, the switch has to jump to the entry of the inner loop. The entry has another selector to transfer control appropriately. some book-keeping is first required to determine which inner loop the control has to be transferred to. once that is done, the right vguard target to jump to in the inner loop has to be determined. as a result, two temporaries are required - one to hold the value of the loop to jump to & the other to figure out the vguard target within the loop.
Internals Description
Here is a pseudocode of the algorithm.
VirtualGuardInfo
is used to record info about a loop;VirtualGuardPair
records the info about virtual guards in the loop.versionNaturalLoop()
performs the cloning of the loop. The implementation uses this routine to perform the book-keeping of the virtual guards so that fixup can be done at the end of the opt.The routine
performLoopTransfer()
performs the actual loop-transfer; in turn callsfixupVirtualGuardTargets()
to perform the fixup (creation of switch blocks etc.)