Clean up generation of load and store sequences for Power

aviansie-ben commented 4 years ago

Currently, the generation of load and store sequences is deeply entangled with TR::MemoryReference and has code copied in numerous different locations. Specifically, we are currently able to generate a memory reference for an arbitrary load/store node from any evaluator and then emit instructions manually for performing the load/store. For instance, loads on Power are implemented like the following (simplified for convenience):

TR::MemoryReference *memRef = TR::MemoryReference::createWithRootLoadOrStore(cg, node, 4);

generateTrg1MemInstruction(cg, TR::InstOpCode::lwz, node, trgReg, memRef);
if (needsSync)
  generateInstruction(cg, TR::InstOpCode::lwsync, node);

memRef->decReferenceCounts(cg);

However, code for emitting loads in particular is copied all over the codegen, as it is common to optimize when a child is a single-reference load by changing the opcode used to perform the load (e.g. an ibyteswap of an iload can be done in a single lwbrx instruction). This has resulted in a number of bugs and design issues that have proliferated uncontrolled throughout the Power codegen:

It makes it easy to forget to emit the post-load lwsync instruction for volatile loads, which can result in nondeterministic and unreproducible bugs (I've found at least a dozen locations that have this problem so far)
It forces code for Java volatile semantics to exist in OMR, as that code must be copied at each location that does a node-based load/store
It forces TR::MemoryReference to keep track of nodes whose reference counts need to be decremented, unnecessarily coupling that class to tree evaluation
It forces TR::MemoryReference to be tightly coupled to how unresolved symbol references work in OpenJ9
It places a lot of complex logic (e.g. tree evaluator optimizations) in TR::MemoryReference, making that class unnecessarily complicated when it should be quite simple in theory

To address these issues, I'd like to propose adding a new extensible class called LoadStoreHandler to the Power codegen that will handle these operations in a safe and controlled manner (this is a slightly simplified version to eliminate some unnecessary implementation details regarding 64-bit loads/stores on 32-bit systems):

class OMR_EXTENSIBLE LoadStoreHandler
   {
   public:
   static void generateLoadNodeSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::InstOpCode::Mnemonic loadOp, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);
   static void generateStoreNodeSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::InstOpCode::Mnemonic storeOp, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);

   static void generateComputeAddressSequence(TR::CodeGenerator *cg, TR::Register *addrReg, TR::Node *node, int64_t extraOffset=0);
   static void generateLoadAddressSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::Register *addrReg, TR::InstOpCode::Mnemonic loadOp, uint32_t length, bool requireIndexForm=false);
   static void generateStoreAddressSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::Register *addrReg, TR::InstOpCode::Mnemonic storeOp, uint32_t length, bool requireIndexForm=false);
   };

The idea here is to provide a simple interface for the evaluators to use that makes it much harder to use incorrectly and hides unnecessary details from the user. For most use cases, a simple call to generateLoadNodeSequence or generateStoreNodeSequence can be used to encompass the entire operation of loading/storing with a particular opcode. This makes it completely impossible to forget to emit memory barriers when required, as the fact that this is even happening at all is hidden from the user entirely.

For more advanced use cases, generateComputeAddressSequence can be used to compute the effective address of a load/store into a register. This register can later be passed to generateLoadAddressSequence or generateStoreAddressSequence to actually produce the load/store and corresponding barriers. While potentially more error-prone, this is required for cases where special code (e.g. write barrier handling) must run between when the child nodes are evaluated and when the memory operation is actually performed.

In order to make extension of these methods by downstream projects easier without needing to copy the OMR definitions of the methods on LoadStoreHelper, the OMR implementations would delegate to some methods from another extensible class that defines internal implementation details and isn't meant to be used directly in tree evaluators:

class NodeMemoryReference
    {
    ...
    public:
    TR::MemoryReference *getMemoryReference();
    void decReferenceCounts(TR::CodeGenerator *cg);
    };

class OMR_EXTENSIBLE LoadStoreHandlerImpl
   {
   static NodeMemoryReference generateMemoryReference(TR::CodeGenerator *cg, TR::Node *node, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);

   static void generateLoadSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::MemoryReference *memRef, TR::InstOpCode::Mnemonic loadOp);
   static void generateStoreSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::MemoryReference *memRef, TR::InstOpCode::Mnemonic storeOp);
   };

Doing this has a couple of notable benefits. For starters, generation of the sequences for loads and stores is centralized into one function which uses an arbitrary TR::MemoryReference without needing to expose that detail to the evaluators. This allows both the address-based and node-based methods to be overridden by a downstream project simultaneously without requiring such code to be duplicated and allows downstream projects' implementations of these methods to delegate to the OMR versions for simple cases without needing to copy code.

Furthermore, the introduction of NodeMemoryReference as an abstract representation of a memory reference alongside a number of nodes whose registers are being used in the memory reference allows TR::MemoryReference to no longer have to worry about keeping track of nodes itself.

For now, these functions would all be implemented by delegating to the existing methods on TR::MemoryReference that we already use in order to avoid breaking changes for the time being. However, significant amounts of code could be moved out of TR::MemoryReference after an interface like this is introduced without breaking evaluators.

fjeremic commented 4 years ago

Could you further explain the need for NodeMemoryReference to exist? It seems like an odd concept to me that doesn't exist elsewhere. What sort of counting or tracking with nodes and memory references need to be done?

aviansie-ben commented 4 years ago

@fjeremic The reason that nodes need to be tracked here is that some of the nodes' registers may end up being used in the TR::MemoryReference itself, while others may have been evaluated and used prior to generation of the memory reference. The nodes whose registers are used in the memory reference must not have their reference counts decremented until after the use of the memory reference. Take the following tree for instance:

[1] aloadi
[2]  aladd
[3]   aload <temp 1>
[4]   lload <temp 2>

Assuming that node [2] is only used here, if node [3] ends up in rX and node [4] ends up in rY with node [1] evaluating into rZ, we want to generate a single ldx rZ, rX, rY instruction to perform the load. In this case, the returned memory reference will end up referring to rX and rY, so the refcounts for nodes [3] and [4] must not be decremented until after the memory reference is used, while the refcount for node [2] can be decremented immediately. Because of this, we need a way to indicate to the caller what nodes need to have their refcounts decremented after the caller is done using the memory reference.

Previously, we handled this by keeping track of a "base node" and an "index node" in TR::MemoryReference itself, but this makes no sense IMO since the vast majority of memory references do not refer to nodes and so do not need to keep track of this information. By splitting this into a separate data structure which is used only where this information needs to be kept track of, details about how a memory reference was constructed from trees can avoid leaking into TR::MemoryReference at all.

fjeremic commented 4 years ago

@aviansie-ben check ArtificiallyInflateReferenceCountWhenNecessary on Z. This is handled transparently [1] by artificially incrementing the reference count of nodes which would die otherwise if decremented by evaluators before the use of the memory reference. The data is kept on a stack during evaluation. It avoids the user having to do anything and they don't have to worry about node reference counts when constructing memory references.

[1] https://github.com/eclipse/omr/blob/9ff4de0d50b9b13c3c4d6c69781249b9bcc8b392/compiler/z/codegen/OMRMemoryReference.cpp#L1203-L1287

aviansie-ben commented 4 years ago

@fjeremic I see how that can resolve the problem, but I'm not sure I like that as a solution. Artificially increasing the reference count and then scheduling it to be decremented later seems like a bit of a hack to me and requires quite a bit of additional bookkeeping, although I have to admit it may be better from the evaluator's point of view to not have to worry about stuff like this.

That being said, I don't think the benefits of a solution like that really apply once an interface like LoadStoreHandler is put into place. The whole point of this new interface is to make it unnecessary for evaluators to construct a TR::MemoryReference referring to the address of a load/store node themselves in the first place, as doing so is inherently dangerous due to the aforementioned issues with volatile variable accesses. The NodeMemoryReference class that's being added here is completely internal to LoadStoreHandler/LoadStoreHandlerImpl, so the evaluators already don't have to worry about the node reference counts themselves.

fjeremic commented 4 years ago

make it unnecessary for evaluators to construct a TR::MemoryReference referring to the address of a load/store node themselves in the first place

Will we be enforcing this by removing such constructors/factory methods on TR::MemoryReference to make sure the user isn't even able to call the APIs which construct memory references by passing in a root load or store?

aviansie-ben commented 4 years ago

Will we be enforcing this by removing such constructors/factory methods on TR::MemoryReference to make sure the user isn't even able to call the APIs which construct memory references by passing in a root load or store?

Sorry, I should have been more clear about this in my original proposal. The eventual goal is to remove these APIs on MemoryReference and in doing so both prevent misuse in evaluators and completely decouple MemoryReference from tree evaluation. The initial PR will not do this and will instead just have the generateMemoryReference function call the old APIs, but all of this code would be moved to be internal to LoadStoreHandlerImpl once OpenJ9 is moved to the new API.

fjeremic commented 4 years ago

Awesome. I support your efforts then! I'll take a look once the dust settles on your eventual PR and if all looks well we can migrate the Z codegen to a similar solution as I do agree the artificial mucking with reference counts is not the greatest solution.

aviansie-ben commented 4 years ago

@gita-omr @zl-wang Just a quick FYI on this, since it's aiming to fix and prevent a family of volatility related bugs I've found in the last week or so where the lwsync/isync barrier after a volatile load is missing when certain codegen optimizations are triggered. Specifically, these are locations like [1] where a load underneath another operation is optimized for without checking whether the symref of the load is volatile. I can confirm that it's possible to trigger this bug under certain conditions, though it can be difficult to get the trees to look exactly right for the bad optimizations to actually trigger (the volatile load must swing down and only be referenced in the one problematic tree), so this could be causing some difficult-to-reproduce race condition bugs.

[1] https://github.com/eclipse/omr/blob/4d32dfda3c999f4e1c27b7dcca73ab27c1989a0f/compiler/p/codegen/UnaryEvaluator.cpp#L520-L530

gita-omr commented 4 years ago

@aviansie-ben could you please add @zl-wang and myself as reviewers? We would certainly like to take a look and discuss.

aviansie-ben commented 4 years ago

@gita-omr There is no PR associated with this at the moment. This is simply an issue for discussing high-level design of the new API for performing these sorts of optimizations in a safe manner. I hope to have a WIP PR open by the end of the day on Monday, and I'll certainly add you as reviewers on that PR once it's open.

aviansie-ben commented 4 years ago

I've now opened #5652 as a WIP PR for the OMR side of this work. These changes still need more testing and there are still many areas in OpenJ9 that need to be updated to use the new API.

gita-omr commented 4 years ago

@aviansie-ben would it be possible to add examples of how the new interface would be used in (1) the basic case (2) the case in https://github.com/eclipse/omr/issues/5630#issuecomment-719061478

gita-omr commented 4 years ago

Nvm, found examples in the PR.

eclipse-omr / omr

Clean up generation of load and store sequences for Power #5630