Open bitwalker opened 4 years ago
I'm in favor of adding some additional operations, and possibly defining some BIF-like functionality as intrinsics. Hans disagrees.
I said that I disagreed to a certain degree, and I think the discussion would have been a bit more productive if you had waited for at least the base of my reasoning.
I am not against having intrinsics in general. True intrinsics are already used for a couple of things in the IR, namely representing the semantics needed for receive blocks.
I mainly took issue with the way you wanted to handle guard functions, and more generally how you wanted to handle some BIFs as intrinsics.
I think it's useful to define a couple of different cases:
erlang:byte_size/1
- Can quite conceivably be handled as an intrinsic by an optimization pass, but doesn't need to be special-cased in any way. Simply having it be a normal function call to a CaptureFunction
PrimOp will enable other optimization passes to transparently propagate constants to its MFA
, making it potentially eventually end up as a constant CaptureFunction
. It can then be trivially handled as an intrinsic in relevant optimization passes, while still maintaining the exact same format in the cases where it is called dynamically.erlang:</2
and the other compare BIFs. These have PrimOps that represent their semantics, and using those would give us free deduplication and potentially make them get scheduled in a more optimal spot.erlang:apply
- Whenever we resolve a static call to this, it should be promoted into either a direct function Call
, or into a function Call
to a CaptureFunction
receive_*
- used for representing the semantics of receive
blocks. These exist and are used today.
%val = extractelement(%term, %index_or_key)
, a general aggregate access operation
- on tuples, it unpacks a specific element at the provided zero-based index
- on cons cells, it unpacks the head via index 0, and the tail via index 1
- on maps, it unpacks the value with the given key
- alternatively, these could each be made into different operations to allow for better analysis
%val = OP(%cons) # where OP = car|cdr|head|tail
%val = extractvalue(%map, %key)
updateelement(%term, %index_or_key, %val)
, an in-place aggregate update operation
- again, could also break this up into multiple typed operations
- this is explicitly to enable optimizations which identify places where in-place mutation of a data structure is safe
Both of these generally sound like a good idea to me, and we will need to have some variant of these eventually.
is_type<T>(%term, ^yes, ^no) and is_type<T>(%term1, .., %termN, ^yes, ^no)
- branches to ^yes if %term (or all terms) are of type T, ^no otherwise
- can be lowered to a runtime call, bit masking/shifting, or elided in favor of a direct branch if analysis can prove that %term (or any term) is or isn't T statically
This is quite trivially represented by a match
with one branch matching on the type, and a fallback wildcard branch. I don't really see the value in having a very specific operation like this when the exact same semantics can be trivially expressed with a strictly more powerful operation.
It could be argued that it is desirable to have a variant of match
that only checks a single condition instead of a list of them, but I don't see any reason to restrict it to type dispatch only.
br ^block(%args..)
- we currently unify branches with function calls, but I think it would make sense to define a branch operation and lower control-flow calls to explicit branches once we've resolved what is and isn't a "real" call. It will make for easier writing of analysis and optimization passes to boot
This is not really true. The call
operation has two variants:
ControlFlow
- This is required to be a call to a local block. It can not create a stack frame. This is what would be the branch operation you describe.Function
- This is required to create a new stack frame. If the call is to a block, the only way control flow can escape from the scope of that block is through a call to one of the functions escape continuations. This makes it illegal for any of the outer functions escape continuations to be accessed from within the inner function.These semantics are preserved in all stages of the IR.
I agree that having some way to represent this would be really beneficial, and I have a couple of ideas I would like to try out.
I decided to open up our discussion here @hansihe, since Slack is a bit awkward for long form writing.
Background
Hans and I had started talking about what kinds of operations (aka instructions), and/or intrinsics, EIR should have, and whether new ones are necessary/desirable. In particular, I'd like EIR to be expressive enough so that backends can generate efficient code from it; and more generally, ensure EIR is granular enough so that passes have plenty to work with for optimizations.
I'm in favor of adding some additional operations, and possibly defining some BIF-like functionality as intrinsics. Hans disagrees.
Operations vs Intrinsics
One thing I want to clarify is how, and why, I'm distinguishing between intrinsics and operations (i.e. instructions) in the IR.
Intrinsics
Intrinsics are:
The opaque nature of intrinsics is also their weakness, as they cannot take advantage of any optimizations until they have been lowered, which may occur quite late in the pipeline. The result in such cases, is either subpar code generation, or requiring duplicate runs of early passes.
Generally speaking, intrinsics are best used as hints, or to invoke target-specific behavior. I think in this regard, the approach to intrinsics in EIR is on point, as they are entirely opaque.
Examples
A few examples from LLVM of things which are implemented as intrinsics rather than operations.
llvm.prefetch
, hints to the code generator that some data should be prefetched, if supported by the target. This is purely a hint, and does not change the semantics of the programllvm.memcpy
, informs the code generator that a memcpy should be performed. Rather than inserting a call tomemcpy
inlibc
, LLVM instead allows the backend to choose how to lower the intrinsic (which in theory could be a runtime call, but in practice is done by generating more precise code directly. Semantically, this is just a function call, but not necessarily lowered as one.llvm.sadd.with.overflow.*
, informs the code generator that a signed addition with overflow should be performed. The precise lowering of this intrinsic is dependent on the target, using hardware support where present, or emulation as a fallback. Likellvm.memcpy
, this is semantically a function call.Operations
Conversely, operations:
All tiers of the compiler operating on the IR must respect the semantics of each operation. In many cases, this happens explicitly, by handling each operation individually. In others, it happens implicitly. For example, writing a compiler pass for an optimization may only directly operate on a subset of operations, but must still preserve the semantics of all operations. This is the weakness of operations - they incur additional maintenance burden and cognitive load.
However, because of their nature, operations are what give an IR its expressive power. The granularity of IR operations directly impacts the kind of optimizations that can be performed - after all, how do you optimize a more general operation to one of higher specificity, if you have nothing but general operations?
So generally speaking, operations are best used as building blocks of varying generality, for things which have a direct impact on the program semantics/behavior.
Examples
Again, from LLVM, some examples of operations (instructions) which illustrate the difference with intrinsics.
call
, a function calling primitivebr
, a basic if/else control flow primitiveswitch
, a generalization ofbr
which allows branching to multiple different locationsunreachable
, acts as a hint for optimization, but its semantics are "undefined behavior". What is notable here is that this operation is fully general, and combined with its prominent role in optimization, it makes sense to have it as an operation rather than as an intrinsicadd
/sub
, some fundamental arithmetic operations. It would be hard to imagine these as intrinsics, since that would preclude constant propagation in many obvious cases, as well as other optimizations that take advantage of the semantics of these operations to combine them, rearrange them, etc., in various ways.extractelement
, an example of aggregate access which permits precise access to an element without copying/moving the value.alloca
, indicates that some memory should be dynamically allocated on the stack, rather than the heapload
/store
, primitive memory read/write operationsThese operations range from more general to more primitive. LLVM IR overall is quite low-level, so many of these operations wouldn't make sense to have in something at the level of EIR, but that's not my point anyway, my reference to LLVM IR is just to illustrate how this distinction between operations and intrinsics is handled in a very broadly used IR.
Problem
So I've laid out the difference between operations and intrinsics, but I haven't really explained why I'm bringing it up to begin with. To illustrate, I think it would help to look in detail at an issue I raised about one area of the IR:
When lowering
Match
, and dealing with a branch with aMatchKind
typeTuple(arity)
,ListCell
, orMapItem
, the semantics indicate that one of the following occurs if matched:Tuple(arity)
, then allarity
elements in the tuple must be "unpacked" or loaded (or at the very least, pointers to the elements must be calculated)ListCell
requires that the head and tail of the cell be unpacked.MapItem
requires that a given key and value are unpackedThe unpacked values are then given as block arguments to the successor block of the branch. However, some or all of the values may not be live in the successor block at all, so we'd like to avoid generating those accesses at all.
Unfortunately, this is where we hit a limitation of the current IR. We can't eliminate the unnecessary accesses in EIR at all, because there are no fine-grained operations for accessing just one element of a tuple, or just the head/tail of a cons cell, etc. Instead, we have to lower EIR to another IR that does have those operations, then write a pass which eliminates the dead operations, and updates the argument list of the successor (as well as any other branches to that block). That is really not desirable - if EIR is to be a foundation on which all kinds of analyses and optimizations are based, it needs to be able to handle this kind of optimization.
To be clear, I'm not saying EIR must support these operations, a backend can handle these things on its own. But without more precision, EIR becomes significantly less useful for analysis and optimization, and it ends up making more sense to lower out of EIR as soon as possible and do all the optimization work in a more extensive IR to avoid uneccessary duplication of effort. We're all better off if we can do more in EIR, and only lower out of it when most optimization has already been performed.
Proposal
I'd like to suggest the following additions to the IR. Note that I'm using a more SSA-like syntax here, but that is just to keep things readable.
Key:
%name
, a virtual register/block argument/SSA-like valueT
, a statically known type^block
, a block nameOperations:
%val = extractelement(%term, %index_or_key)
, a general aggregate access operation%val = OP(%cons) # where OP = car|cdr|head|tail
%val = extractvalue(%map, %key)
updateelement(%term, %index_or_key, %val)
, an in-place aggregate update operationis_type<T>(%term, ^yes, ^no)
andis_type<T>(%term1, .., %termN, ^yes, ^no)
^yes
if%term
(or all terms) are of typeT
,^no
otherwise%term
(or any term) is or isn'tT
staticallybr ^block(%args..)
From my perspective, the above are pretty essential, and likely don't cover the full range of things that would be helpful to have in the IR, but certainly hit the major pain points. Feel free to bikeshed the naming/etc., but the semantics are really the important part.
Things this enables:
is_type<fixnum>(%a, %b, ^fast_math, ^slow_math)
Problems this introduces:
Case
is not present after pattern compilation). This would potentially introduce additional cases of this, as some operations would only be present after optimizations (br
,updateelement
).A Note On Dialects
One thing that I think may alleviate the above is leaning in to the concept of "dialects" which kinda/sorta exists in EIR, but not really. MLIR is a fantastic example of how dialects can work in practice, and I would suggest taking a look if you haven't already.
In short though, you start with a high-level dialect, and then lower through progressively more primitive dialects as operations in the high-level dialect get converted to their constituent operations in lower-level dialects. However, multiple dialects can be in play at the same time, so it isn't your typical compiler pipeline that lowers from A -> B -> C, it is more like A -> AB -> ABC -> BC -> C.
To that point, each dialect can define partial or full conversions to another dialect, and in the case of MLIR, it will handle transitive conversions as long as there is a path from dialect
A
to dialectZ
. For example, if I define an EIR dialect, and a conversion to the MLIR Standard dialect, I can convert directly from EIR to the LLVM IR dialect automatically, since the Standard dialect defines a conversion to LLVM IR dialect. Passes can perform partial lowerings of operations from one dialect to another as part of performing an optimization, which matches quite well to some of the patterns I mentioned earlier.The only catch is that I'm not sure how best to design a dialect system in Rust. MLIR has the flexibility of C++ to work with, while Rust is much more strict (thankfully). That said, I think the most imporant pieces would fit naturally in the trait system, so it may be worth modeling the skeleton of MLIRs hierarchy in Rust to see where it falls apart, rather than discounting it completely.