Designing the IR - Githubissues

Since the symbols module is soon going to reach a state where it's relatively usable, I believe it's time to think about the next major step: compiling/executing. Since most compilers introduce some intermediate representation to help with multiple things (ease of compilation, easier optimization, ...), I believe this is the way to go.

We should come up with our own intermediate representation that could either be executed in a VM relatively efficiently, or be compiled to backends we want (like x86, WASM, MSIL, maybe even C, ...). This representation should be:

general enough to fit usual cases
low-level enough so that translation doesn't become borderline impossible
well-defined, so we don't build on UB from the start
extensible, so users can adapt it, or build their own abstractions for certain domains

Level of abstraction I believe that we should stick close to the CPUs of today, but abstract away just enough not to make common operations a chore.

For example, on x86 there are different instructions for floating-point and integer arithmetic. Also, adding integers larger than a register means keeping track of the carry with the adc instruction. I believe that if we have a very simple type-system that can uniquely identify what we want, we could have way less instructions.

I believe that we should have few instructions that do elementary operations. For anything fancier, we could add intrinsics or extensions later.

An initial, small set of instructions should be sufficient:

Basic arithmetic: addition, subtraction, multiplication, division, modulo
Bit manipulation: shift-left, shift-right, bitwise-and/or/xor
Comparison
Conditional and unconditional jumping
Procedure calling and returning
Memory indirection: read from and write to a memory address

We will definitely need a few more when planning out the language more, like typecasting (reinterpret/transumte) or getting the pointer for a member in a structure (something like getelementptr in LLVM).

We need to be able to represent everything that can be architecture-dependent as non-hardcoded constants. For example, the size of pointers are platform-dependent, so the IR shouldn't contain hardcoded 4 or 8 values, but something like sizeof(ptr) instead.

Code structure I believe that a flat hierarchy should be enforced: no nested procedures or types. We don't need many levels either:

Assembly: the compilation unit. Contains procedures, external declarations, type definitions, constant data
Procedure: Defines its parameters, return-type and stores its basic blocks
Basic block: A sequence of instructions that don't contain any branching, except for the last instruction. These can be thought of as atomic units of execution.
Instruction

The instructions should be in SSA form, every temporary gets a unique name and only gets assigned once. Mutables could be achieved through memory indirection (with read and write instructions), similarly to LLVM.

Types I don't think we need a very sophisticated type-system. We need to define the built-in primitive types, and allow structs and untagged unions to be defined. All operations/instructions should be typed.

Behavior To avoid UB, we have to be very explicit about defining each operation to something sensible. For example, addition could overflow, but we need define exactly, what should happen in that case. We could either let it wrap around, panic, or do whatever the platform does.

Ideally, we would be strict and simply panic, unless we somehow specify that we want a specific behavior. We could introduce attributes for example, to essentially parameterize instructions: local_3 = add local_2, local_1 [overflow=wrap]

Attributes would also help specifying things, that are required for certain platforms. For example, a common one would be the link name and calling convention for procedures:

extern procedure foo() [link_name = "foo@4", callconv=stdcall]

Extensibility for abstractions I believe it's very important to let the user extend the language with abstractions they need. For example, it would be very painful to implement a dynamically typed language with just strongly typed elements. While I believe that this is a detail that should be specified by the user, we should still help them to be able to abstract it away.

I have thought of 2 ways of extending the IR so far:

Higher level instructions, essentially "macros"
Attributes that act like decorators

For the former, a good example would be virtual calls. Let's say, that a virtual call would consist of these elementary instructions:

local_1 = elementptr my_obj, vtable ; Get the address of the vtable in the object
local_2 = add local_1, proc_index ; Calculate the address where the procedure address is stored
local_3 = read local_2 ; Read out the procedure address
local_4 = transmute local_3, type_of_proc ; Cast it to the proper procedure type
local_5 = call local_4 my_obj, arg1, arg2, ...

When looking at the IR, this is noise. Ideally, the user would just see virtcall my_obj.foo(arg1, arg2, ...). Some kind of macros or instruction templates could solve this.

For the latter, an example would be generators. To convert a procedure into a generator procedure, a lot of analysis and transformation has to be done, so it'd be better, if the user would just receive the entire thing for transformation. Usage example would be:

[generator]
procedure get_123():
    yield 1
    yield 2
    yield 3

Here, the user would receive the model for get_123, so they can transform it however they want.

I believe these two kinds of abstractions would fit many cases and are not too hard to implement.

Extensibility for code passes I think that many domains will come with their own code-generation patterns. While having optimization passes for every single existing situation would be neat, it would also be very much infeasible. Allowing to define custom code-passes would be great.

Static Analysis tools Since we plan on the users writing their custom extensions, manipulating the IR should be as simple as possible. We should provide all the tools we can, to be able to analyze and correctly transform the code.

Here I think about things like providing a Control-Flow graph and related algorithms, custom register allocation and so on. Essentially, we should open up as much of the analysis, as possible. Otherwise, the user would have to reimplement a lot of what optimizations and other transformations do internally.

Code building Building IR code should be supported by some builder utility, that would keep track of the current basic block, and such. For instructions that result in some temporary value, it could also return an object that can be used to reference that value. For example, building a recursive Fibonacci could look something like:

var builder = new IrBuilder();

// Define the fibonacci procedure
// Defining a new procedure implicitly adds a new basic-block too
var fib = builder.DefineProcedure("fib");
// Define a single integer argument
var arg = builder.DefineParameter(i32);
// First, we compare the argument to the constant 2, we care about if it's less
var cmpResult = builder.Cmp(Comparison.Less, arg, i32.NewValue(2));
// Save the block we are at, now we write the 2 branches
var lastBb = builder.CurrentBasicBlock;
// The then branch, when arg < 2
var thenBb = builder.DefineBasicBlock("then");
// Just return constant 1
builder.Ret(i32.NewValue(1));
// Else branch
var elseBb = builder.DefineBasicBlock("else");
// We subtract 1 from the arg and call fib with that
var arg_1 = builder.Sub(arg, i32.NewValue(1));
var fib_1 = builder.Call(fib, new[] { arg_1 });
// We subtract 2 from the arg and call fib with that
var arg_2 = builder.Sub(arg, i32.NewValue(2));
var fib_2 = builder.Call(fib, new[] { arg_2 });
// Add the two call results together and return it
var result = builder.Add(fib_1, fib_2);
builder.Ret(result);
// We go back to finish the branch
builder.CurrentBasicBlock = lastBb;
// Write the branch instruction
builder.JmpIf(cmpResult, thenBb, elseBb);

// Retrieve the built assembly
var assembly = builder.Assembly;

This might not be an ideal API, but looks decent enough in my opinion.

Very much open questions

Do we want a textual representation? It would make debugging easier, but then we might also want to be able to parse in the IR from text.
Do we want a bytecode representation?

Idea: extending the assembly syntax An odd place to start, but if we want a clean, readable syntax, we need to allow a bit more than just a name + arguments in parenthesis syntax. For example, in the initial post I've laid out an example syntax for a virtual call instruction extension: virtcall my_obj.foo(arg1, arg2, ...). This looks way better (and is way more readable) than a function-like macro, for example virtcall(foo, my_obj, arg1, arg2, ...).

Extending a parser can be problematic for efficiency, but I believe that a small compromise wouldn't hurt anyone to get rid of the efficiency problem: we could allow the user to register parsers for custom instruction names. From there on they are allowed to parse from the stream.

An example API usage for the above example:

public class VirtcallInstructionParser : IInstructionParser
{
    // When the user registers this extension, this is the property that tells the parser, on which instruction name to hand over parsing
    public string InstructionName => "virtcall";

    public IInstruction Parse(InstructionTokenStream tokens)
    {
        // When this is called, the instruction name is already consumed, the user only has to parse the things after that
        var obj = tokens.ParseValue(); // Parses a primitive value, like my_obj
        tokens.Expect('.'); // Expect a dot separator and consume
        var funcName = tokens.ExpectIdentifier(); // Expect the name of the called function
        tokens.Expect('('); // Open parens for call
        var args = new List<Value>();
        // Comma separated args
        if (!tokens.Peek(')'))
        {
            args.Add(tokens.ParseValue());
            while (tokens.Matches(',')) args.Add(tokens.ParseValue());
        }
        tokens.Expect(')'); // Close parens for call
        return new VirtcallInstruction(obj, funcName, args);
    }
}

Overloading existing instructions It might be a good idea to allow for overloading instructions with the same verb. For example, the add should work both for integer and floating-point numbers, the backend should deal with the proper handling or translation. This is mentioned in the original issue already.

Another interesting use would be to allow overloading for custom types. A dynamic language would suffer, if it had to generate type-dispatch code each time an operation is performed. Allowing to define custom types and overloading instructions for these would solve this issue.

Note, that I'm not advising that this should be a mechanism for operator-overloading. I believe classical operator overloading requires a lot more sophisticated things than instruction overloading and should be handled by the compiler frontend/middleware. This would simply serve the purpose of making the IR generation for dynamic languages or languages with custom primitives simpler.

Another interesting use would be to define SIMD types for the IR. I believe that IR extensions should be strong enough that implementing SIMD operations on custom-defined SIMD types should be possible.

Okay, here goes my idea (part 1 of many).

Philosophy and design

The following ideas take inspiration in LLVM IR, Cranelift IR and CIL opcodes. The proposed IR language tries to achieve the following objetives, for which some design decisions are proposed:

Simplicity: the language needs to get to the point and stay readable and terse, being easy to understand and easy to work with.
- No alignment. Decisions such as alignment, call conventions, etc. should be left for attributes, at the instruction, block, procedure... scope where it is deemed relevant. The core instructions per se should consider this a target implementation detail, and hence not care about this as part of their main syntax. Hence, in LLVM IR this is a valid instruction:
```
%0 = load i32* %a.addr, align 4
```
  but in Yoakke IR we should strive to keep the load operation simple and loyal to the core concept, like this:
```
%0 = load i32*, addr(a)
```
- Small instruction set. Contrary to the CIL opcode paradigm and following the conceptual model of a CPU, the IR should strive to offer a few powerful and ortogonal instructions that can be combined and extended to implement most (to not say all) operations that can be described in a modern programming language.
- Focus on readability. In situations that are confusing, the syntax should aid the reader. For example, the storage instruction could be defined like this: store i32 -> *i32, where the arrow symbol is explicitly typed as part of the syntax. This makes it clear what value is being stored where.
- No indexation modes. Similar to a RISC CPU, all operations should be defined as instructions to preserve clarity and reduce cognitive load, including loads and stores.
Safety: Yoakke IR should explain the semantics of the code thoroughly, with little tolerance for UB and benefitting from as many static analysis techniques as possible.
- Static and strong typing for every instruction. Just like in LLVM IR, and for the same reasons. No type inference should be provided; types should be always indicated.
- Static instruction resolution based on types. So, there is one and only one instruction exposed for addition (add), and the specific small step semantics are defined based on the types of the arguments and the result. The following instruction variants are all different, but they are exposed as one instruction to the IR generator:
```
i64 = add i32, i32
i128 = add i64, i64
u64 = add u32, u32
```
- Dynamic sanity checks. In a similar way to how the JVM itselfs emits exceptions not present in the original code in "world is broken" situations (e.g. ClassCastException), Yoakke IR's design and semantics should be friendly to techniques that abrupltly abort ("panic") or unwind the current call stack ("exceptions") when the information contained in the bytecode is sufficient to identify error situations.
Portability: the code should be easy to use and valid no matter what is the target machine or the context of the usage (to a reasonable extent).
- We should offer several ways of storing and processing Yoakke IR. It should be able to be represented, as a bare minimum, both as a serializable in-memory data structure (performance-efficient) AND a deserializable bytecode format for persistent storage file systems (storage-efficient).
- Instructions and concepts should not make assumptions about the underlying platform and infrastructure, to a reasonable extent.

Preamble: defining a syntax

I think the first step is defining a syntax that can be used to quickly and effectively share and talk about Yoakke IR. This syntax will be useful for discussions between collaborators and users of Yoakke, for bug reports, for community questions... This answers one of the original questions raised in the original comment: yes, we should have a canonical textual representation for Yoakke IR, and this textual representation should extend to the extensions made by users. This means, for example, that creating a macro would require the user to give it a name and then it would be printed following the same syntax rules that built-in basic instructions use. And, in order to enforce this canonical representation and be able to benefit from it, we should implement an utility capable of emitting this canonical textual representation automatically from at least one (but ideally all) of the execution-compatible forms of Yoakke IR (e.g. in-memory data structure, bytecode...).

This is an example function in C:

int sum(int a, int b) {
  return a + b;
}

and this is a summary of the same function as unoptimized LLVM IR:

define i32 @sum(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32* %a.addr, align 4
  %1 = load i32* %b.addr, align 4
  %add = add nsw i32 %0, %1
  ret i32 %add
}

attributes #0 = { nounwind ssp uwtable ... }

This is the same function in the proposed Yoakke IR language (also doing the same "useless" operations to be fair and to show syntax):

attributes #0 = [ foo, bar, baz ... ]

procedure @sum(a i32, b i32) -> i32 #0 {
entry(%0 i32, %1 i32):
  store a i32 -> %0 i32*
  store b i32 -> %1 i32*
  va i32 = load %0 i32*
  vb i32 = load %1 i32*
  result i32 = add va i32, vb i32 [ overflow = saturate ]
  ret result i32
}

Symbols (bindings) use no symbols, like for example result. In case a name cannot be found or is not relevant, a symbol can be created with % followed by a number (e.g. %0). The symbol @sum starts with @ because it is not local.

As defined in the original comment of this issue, attributes are added using brackets like this:

procedure @sum(i32 a, i32 b) -> i32 [ foo, bar, baz ... ] {

but in case the user wishes to reuse the same attributes in more than one place, a feature called attribute sets should be implemented that works like in LLVM:

attributes #0 = [ foo, bar, baz ... ]

procedure @sum(i32 a, i32 b) -> i32 #0 {      ; equivalent, #0 expands to a valid list of attributes

A meta-representation of instruction variant signatures

The signature of the different instruction variants used above look like this:

i32 = add i32, i32                 ; example: result i32 = add va i32, vb i32
store i32 -> i32*                  ; example: store i32 a -> i32* %0
i32 = load i32*                    ; example: va = load i32* %0
*i32 = slot type#i32               ; example: %0 = slot i32

Following the convention of CIL opcodes, we could use a dot and a suffix on mnemonics to indicate related instructions. For example, addition with an immediate could reuse the mnemonic add or use add.i:

i32 = add i32, imm#i32                 ; example: result i32 = add va i32, 7 i32
  ; OR
i32 = add.i i32, imm#i32               ; example: result i32 = add.i va i32, 7 i32

IR basics

Type ideas

bN represents a boolean of N bits. For example, b1 and b8 represent a boolean value stored in 1 and 8 bits, respectively. The literals for compatible values are true and false. The equivalence 0 = false and 1 = true should not be assumed.
iN represents signed integer values of N bits. Examples: i8, i16, i32, i64...
uN represents unsigned integer values.
f32 and f64 represent floating point values.
TxM represents an array of M elements of type T. Length probably not stored. Examples: i64x341, u32x4...
simd{TxM} a SIMD vector type. Example: simd{i64x4}
T* represents a pointer to a value of type T. Examples: i32*, f64*...
imm#T represent an immediate of type T. Example: 5 is a imm#i32.
offsetN represents an address offset of N bits. Example: offset32.
type represents the textual representation of one of types (instead of a value) shown above. Example: i32 is a type.

Instruction ideas

nop No operation.
T = load T* Load from memory into register.
store T -> T* Store from register into memory.
iY = add iX, iX Addition
bY = and bX, bX Logical AND
ret, ret T Return from a method
jump ~> offsetN Jump (jump = inconditional branch) to target.
br.false bX ~> offsetN Branch (branch = conditional jump) if value is false.
br.true bX ~> offsetN Branch if value is true.
br.zero iX ~> offsetN, j.zero uX offsetN Branch if value is zero.
br iX == iX ~> offsetN, br uX == uX ~> offsetN Branch if values are equal.
br iX < iX ~> offsetN, br uX < uX ~> offsetN Branch if...
br iX > iX ~> offsetN, br uX > uX ~> offsetN
br iX <= iX ~> offsetN, br uX <= uX ~> offsetN
...
iY = add iX, iX, uY = add uX, uX Addition.
iY = sub iX, iX, uY = sub uX, uX Subtraction.
iY = mul iX, iX, uY = mul uX, uX Multiplication.
iY = div iX, iX, uY = div uX, uX Division.
iY = rem iX, iX, uY = rem uX, uX Remainder.
bY = and bX, bX Logical AND.
bY = or bX, bX Logical OR.
bY = xor bX, bX Logical XOR.
bY = not bX Logical NOT.
T* = slot type#T Reserve a local slot (i.e. stack local variable in x64)
T* = alloc type#T Allocate dynamic memory (i.e. heap)
free T* Free dynamic memory (i.e. heap)
U = conv T, U Convert types by running necessary algorithms on runtime.
U = transmute T, U Reinterpret types statically.
V* = getfieldptr T, U Get field of a compound type as pointer.
V = getfield T, U Get field of a compound type as value (I don't advise it: less granular than pointer version).
T = ldelement TxM uX Get element of array by index.
stelement T -> TxM uX Set element of array by index.
call procedure(T, U...), R = call procedure(T, U...) Call another procedure.
?? Call a native function.
dbg.break Inform a debugger that a breakpoint has been reached.

C# builder API experience

I am not entirely convinced by the API shown in the example, because it requires accessing the fields here:

var cmpResult = builder.Cmp(Comparison.Less, arg, i32.NewValue(2));
var lastBb = builder.CurrentBasicBlock;  // <----

I think we could take inspiration from Microsoft's System.Reflection.Emit API and define a diferent builder for each logic level using the Builder pattern, so the Fibonacci example could be like this instead:

var assembly = new AssemblyBuilder();

var fib = new ProcedureBuilder("fib").WithParameter(i32, out var arg);
var then = new BasicBlockBuilder("then").Ret(i32.NewValue(1)).Build();
var @else = new BasicBlockBuilder("else")
  .Sub(arg, i32.NewValue(1), out var arg1)
  .Call(fib, new[] { arg1 })
  .Sub(arg, i32.NewValue(2), out var arg2)
  .Call(fib, new[] { arg2 })
  .Add(fib_1, fib_2, out var result)
  .Ret(result)
  .Build();
var start = new BasicBlockBuilder("entry").Br(Comparison.Less, arg, i32.NewValue(2), then, @else).Build();  // not compatible with branching instructions defined above, but ignore this detail for now
fib = fib.WithEntryAt(start).Build();
var assembly = assembly.WithProcedure(fib).Build();

Upcoming thoughts

Memory model and slots
Extensibility and "add-ons"
Extern and native functions; FFI; instrinsic functions
Macro attributes

My initial thoughts on the proposal. Overall: very nice, mentions a lot of important aspects. Well done!

Philosophy and design

Simplicity: Completely agree on all points.
Safety: Agree on static and strong typing, but not sure about being explicit about the types on usage site. I think that it's noise, even in LLVM IR.
Portability: Completely valid and valuable points, agreed.

Preamble: defining a syntax

The canonical, automatic representation could be generated from an attribute, I believe that doesn't introduce much noise and it could also drive the text parser. Something like so:
```
[InstructionSyntax("{Target} <- {Source}")]
public record Load(Value Target, Value Source);
```
I partially like the proposed, changed syntax presented in the @sum snippet, but I have a couple of nitpicks. Why would we need to annotate non-locals with @ (or even, why annotate them at all)? For the linker, we could just push that into the attribute list. Why differentiate "bindings" from other locals? They look like they serve the same purpose. Attribute sets are a good idea, nice convenience feature.

A meta-representation of instruction variant signatures

This is great, it gives a nice overview of instructions and what arguments they accept. It's a nice way to skim through the instructions and see what they support. Lovely!
I'm not sure I 100% understand what you mean by "related instructions". Do you mean instructions that have slight variations, like what you did with br.true and br.zero? If so, this mostly feels like a syntactic thing, allowing dots in instruction verbs.

IR basics

Type ideas

Separating integer and boolean types is definitely an interesting idea.
Are we sure we want to model the immediate/literal role in the type itself? I don't see the point there to be honest.

Instruction ideas

I'm not sold on the br variants. I think this is not a nice, orthogonal way. A separate comparison instruction would be more general: it doesn't have to be used for a jump argument, it can also be just a computation result. To me this feels like it was straight ripped out of the x86 jump variants section, a bit too CISC-like.
Are we sure we want to bring in heap allocation? Don't we want to keep that out of the IR and have it as system calls?
Are we sure we even want to bring in stack allocation as an instruction, and not just being part of the procedure definitions? Procedures could just declare what locals they want to allocate.
I dislike the value-variant getfield, feels redundant (and looks like you do too).
ldelement and stelement also feel redundant, I feel like all should be solved with getfieldptr or some common instruction for them.
I don't think I'd want to differentiate native calls. Let them be call.
dbg.break reminds me that we need to think about debug symbols too.
- Some short names are weird and kind of inconsistent for my taste, but that's just a very small nitpick, easy to change.

C# builder API experience

All I have to say is that I like the API design, I'm sold on that. We should make sure that the names we provide are only suggestions for things like labels, I don't think I'd want the user dealing with generating unique label names each time.

Just to get my hands dirty (and because I couldn't really just not work on it), I did a micro-experiment of how we would model types. For a challenge, I tried to model a type, that's a signed integer and can exactly store a pointer type on the given architecture. The construction code - without any builders - looks like so:

var voidPtr = new Types.Ptr(Model.Types.Void.Instance);
var voidPtrSize = new Values.SizeOf(voidPtr);
var intPtr = new Types.Int(voidPtrSize);
Console.WriteLine(intPtr);

This prints i{size_of(*void)}, which means "a signed integer with as many bits as size_of(*void) returns". This makes the model a bit more complex: we can't just have an integer argument as the bits width for the integer type, we need some IConstant instead that represents compile-time constant values.

I'm not yet sure that this complexity is worth it, but might be fine, if we can erase it at a later stage, when we receive platform-specific information. This would mean that we can do platform-independent optimizations while not resolving things like this, but as soon as we desugar things like size_of, more extreme optimizations can take place.

Maybe this also begs for aliases, reading i{N} or i{size_of(...)} can become pretty noisy, so type-aliases should be considered, so we could say type i32 = i{32} for example.

Note, that this makes type-checking a bit too strict, but that could be considered a good thing. i{size_of(*void)} cannot be known exactly until choosing a back-end, so it is only really compatible with the exact same signature, i{size_of(*void)}. Again, we should consider, if we want/need this complexity.

The power of attributes

I believe that a strong attribute system could help us in avoiding a lot of noise in the core IR itself. Attributes could contain debug information, as well as things, like the result of register allocation. It could be taken to the extreme, where the user could literally do manual register-allocation, if they wish so:

v2 = add v0, v1 [register = r0]

Many optimizations could be enforced, like [tail_call], [force_inline], ...

I'm sure there would be many-many uses for the attribute system, so we should make it rich, extensible and easy to read/write in the API.

The one thing we should avoid is having the attribute system modifying the semantics of the IR. Specifying platform-specific behavior should be fine, but attributes should be useful metadata, nothing more. The IR should execute the same on its theoretical machine without the attributes.

LanguageDev / Yoakke

Designing the IR #93

Philosophy and design

Preamble: defining a syntax

A meta-representation of instruction variant signatures

IR basics

Type ideas

Instruction ideas

C# builder API experience

Upcoming thoughts

Philosophy and design

Preamble: defining a syntax

A meta-representation of instruction variant signatures

IR basics

Type ideas

Instruction ideas

C# builder API experience

The power of attributes