LanguageDev / Yoakke

A collection of libraries for implementing compilers in .NET.
Apache License 2.0
141 stars 8 forks source link

Designing the IR #93

Open LPeter1997 opened 2 years ago

LPeter1997 commented 2 years ago

Since the symbols module is soon going to reach a state where it's relatively usable, I believe it's time to think about the next major step: compiling/executing. Since most compilers introduce some intermediate representation to help with multiple things (ease of compilation, easier optimization, ...), I believe this is the way to go.

We should come up with our own intermediate representation that could either be executed in a VM relatively efficiently, or be compiled to backends we want (like x86, WASM, MSIL, maybe even C, ...). This representation should be:

Level of abstraction I believe that we should stick close to the CPUs of today, but abstract away just enough not to make common operations a chore.

For example, on x86 there are different instructions for floating-point and integer arithmetic. Also, adding integers larger than a register means keeping track of the carry with the adc instruction. I believe that if we have a very simple type-system that can uniquely identify what we want, we could have way less instructions.

I believe that we should have few instructions that do elementary operations. For anything fancier, we could add intrinsics or extensions later.

An initial, small set of instructions should be sufficient:

We will definitely need a few more when planning out the language more, like typecasting (reinterpret/transumte) or getting the pointer for a member in a structure (something like getelementptr in LLVM).

We need to be able to represent everything that can be architecture-dependent as non-hardcoded constants. For example, the size of pointers are platform-dependent, so the IR shouldn't contain hardcoded 4 or 8 values, but something like sizeof(ptr) instead.

Code structure I believe that a flat hierarchy should be enforced: no nested procedures or types. We don't need many levels either:

The instructions should be in SSA form, every temporary gets a unique name and only gets assigned once. Mutables could be achieved through memory indirection (with read and write instructions), similarly to LLVM.

Types I don't think we need a very sophisticated type-system. We need to define the built-in primitive types, and allow structs and untagged unions to be defined. All operations/instructions should be typed.

Behavior To avoid UB, we have to be very explicit about defining each operation to something sensible. For example, addition could overflow, but we need define exactly, what should happen in that case. We could either let it wrap around, panic, or do whatever the platform does.

Ideally, we would be strict and simply panic, unless we somehow specify that we want a specific behavior. We could introduce attributes for example, to essentially parameterize instructions: local_3 = add local_2, local_1 [overflow=wrap]

Attributes would also help specifying things, that are required for certain platforms. For example, a common one would be the link name and calling convention for procedures:

extern procedure foo() [link_name = "foo@4", callconv=stdcall]

Extensibility for abstractions I believe it's very important to let the user extend the language with abstractions they need. For example, it would be very painful to implement a dynamically typed language with just strongly typed elements. While I believe that this is a detail that should be specified by the user, we should still help them to be able to abstract it away.

I have thought of 2 ways of extending the IR so far:

For the former, a good example would be virtual calls. Let's say, that a virtual call would consist of these elementary instructions:

local_1 = elementptr my_obj, vtable ; Get the address of the vtable in the object
local_2 = add local_1, proc_index ; Calculate the address where the procedure address is stored
local_3 = read local_2 ; Read out the procedure address
local_4 = transmute local_3, type_of_proc ; Cast it to the proper procedure type
local_5 = call local_4 my_obj, arg1, arg2, ...

When looking at the IR, this is noise. Ideally, the user would just see virtcall my_obj.foo(arg1, arg2, ...). Some kind of macros or instruction templates could solve this.

For the latter, an example would be generators. To convert a procedure into a generator procedure, a lot of analysis and transformation has to be done, so it'd be better, if the user would just receive the entire thing for transformation. Usage example would be:

[generator]
procedure get_123():
    yield 1
    yield 2
    yield 3

Here, the user would receive the model for get_123, so they can transform it however they want.

I believe these two kinds of abstractions would fit many cases and are not too hard to implement.

Extensibility for code passes I think that many domains will come with their own code-generation patterns. While having optimization passes for every single existing situation would be neat, it would also be very much infeasible. Allowing to define custom code-passes would be great.

Static Analysis tools Since we plan on the users writing their custom extensions, manipulating the IR should be as simple as possible. We should provide all the tools we can, to be able to analyze and correctly transform the code.

Here I think about things like providing a Control-Flow graph and related algorithms, custom register allocation and so on. Essentially, we should open up as much of the analysis, as possible. Otherwise, the user would have to reimplement a lot of what optimizations and other transformations do internally.

Code building Building IR code should be supported by some builder utility, that would keep track of the current basic block, and such. For instructions that result in some temporary value, it could also return an object that can be used to reference that value. For example, building a recursive Fibonacci could look something like:

var builder = new IrBuilder();

// Define the fibonacci procedure
// Defining a new procedure implicitly adds a new basic-block too
var fib = builder.DefineProcedure("fib");
// Define a single integer argument
var arg = builder.DefineParameter(i32);
// First, we compare the argument to the constant 2, we care about if it's less
var cmpResult = builder.Cmp(Comparison.Less, arg, i32.NewValue(2));
// Save the block we are at, now we write the 2 branches
var lastBb = builder.CurrentBasicBlock;
// The then branch, when arg < 2
var thenBb = builder.DefineBasicBlock("then");
// Just return constant 1
builder.Ret(i32.NewValue(1));
// Else branch
var elseBb = builder.DefineBasicBlock("else");
// We subtract 1 from the arg and call fib with that
var arg_1 = builder.Sub(arg, i32.NewValue(1));
var fib_1 = builder.Call(fib, new[] { arg_1 });
// We subtract 2 from the arg and call fib with that
var arg_2 = builder.Sub(arg, i32.NewValue(2));
var fib_2 = builder.Call(fib, new[] { arg_2 });
// Add the two call results together and return it
var result = builder.Add(fib_1, fib_2);
builder.Ret(result);
// We go back to finish the branch
builder.CurrentBasicBlock = lastBb;
// Write the branch instruction
builder.JmpIf(cmpResult, thenBb, elseBb);

// Retrieve the built assembly
var assembly = builder.Assembly;

This might not be an ideal API, but looks decent enough in my opinion.

Very much open questions

LPeter1997 commented 2 years ago

Idea: extending the assembly syntax An odd place to start, but if we want a clean, readable syntax, we need to allow a bit more than just a name + arguments in parenthesis syntax. For example, in the initial post I've laid out an example syntax for a virtual call instruction extension: virtcall my_obj.foo(arg1, arg2, ...). This looks way better (and is way more readable) than a function-like macro, for example virtcall(foo, my_obj, arg1, arg2, ...).

Extending a parser can be problematic for efficiency, but I believe that a small compromise wouldn't hurt anyone to get rid of the efficiency problem: we could allow the user to register parsers for custom instruction names. From there on they are allowed to parse from the stream.

An example API usage for the above example:

public class VirtcallInstructionParser : IInstructionParser
{
    // When the user registers this extension, this is the property that tells the parser, on which instruction name to hand over parsing
    public string InstructionName => "virtcall";

    public IInstruction Parse(InstructionTokenStream tokens)
    {
        // When this is called, the instruction name is already consumed, the user only has to parse the things after that
        var obj = tokens.ParseValue(); // Parses a primitive value, like my_obj
        tokens.Expect('.'); // Expect a dot separator and consume
        var funcName = tokens.ExpectIdentifier(); // Expect the name of the called function
        tokens.Expect('('); // Open parens for call
        var args = new List<Value>();
        // Comma separated args
        if (!tokens.Peek(')'))
        {
            args.Add(tokens.ParseValue());
            while (tokens.Matches(',')) args.Add(tokens.ParseValue());
        }
        tokens.Expect(')'); // Close parens for call
        return new VirtcallInstruction(obj, funcName, args);
    }
}
LPeter1997 commented 2 years ago

Overloading existing instructions It might be a good idea to allow for overloading instructions with the same verb. For example, the add should work both for integer and floating-point numbers, the backend should deal with the proper handling or translation. This is mentioned in the original issue already.

Another interesting use would be to allow overloading for custom types. A dynamic language would suffer, if it had to generate type-dispatch code each time an operation is performed. Allowing to define custom types and overloading instructions for these would solve this issue.

Note, that I'm not advising that this should be a mechanism for operator-overloading. I believe classical operator overloading requires a lot more sophisticated things than instruction overloading and should be handled by the compiler frontend/middleware. This would simply serve the purpose of making the IR generation for dynamic languages or languages with custom primitives simpler.

Another interesting use would be to define SIMD types for the IR. I believe that IR extensions should be strong enough that implementing SIMD operations on custom-defined SIMD types should be possible.

skneko commented 2 years ago

Okay, here goes my idea (part 1 of many).

Philosophy and design

The following ideas take inspiration in LLVM IR, Cranelift IR and CIL opcodes. The proposed IR language tries to achieve the following objetives, for which some design decisions are proposed:

Preamble: defining a syntax

I think the first step is defining a syntax that can be used to quickly and effectively share and talk about Yoakke IR. This syntax will be useful for discussions between collaborators and users of Yoakke, for bug reports, for community questions... This answers one of the original questions raised in the original comment: yes, we should have a canonical textual representation for Yoakke IR, and this textual representation should extend to the extensions made by users. This means, for example, that creating a macro would require the user to give it a name and then it would be printed following the same syntax rules that built-in basic instructions use. And, in order to enforce this canonical representation and be able to benefit from it, we should implement an utility capable of emitting this canonical textual representation automatically from at least one (but ideally all) of the execution-compatible forms of Yoakke IR (e.g. in-memory data structure, bytecode...).

This is an example function in C:

int sum(int a, int b) {
  return a + b;
}

and this is a summary of the same function as unoptimized LLVM IR:

define i32 @sum(i32 %a, i32 %b) #0 {
entry:
  %a.addr = alloca i32, align 4
  %b.addr = alloca i32, align 4
  store i32 %a, i32* %a.addr, align 4
  store i32 %b, i32* %b.addr, align 4
  %0 = load i32* %a.addr, align 4
  %1 = load i32* %b.addr, align 4
  %add = add nsw i32 %0, %1
  ret i32 %add
}

attributes #0 = { nounwind ssp uwtable ... }

This is the same function in the proposed Yoakke IR language (also doing the same "useless" operations to be fair and to show syntax):

attributes #0 = [ foo, bar, baz ... ]

procedure @sum(a i32, b i32) -> i32 #0 {
entry(%0 i32, %1 i32):
  store a i32 -> %0 i32*
  store b i32 -> %1 i32*
  va i32 = load %0 i32*
  vb i32 = load %1 i32*
  result i32 = add va i32, vb i32 [ overflow = saturate ]
  ret result i32
}

Symbols (bindings) use no symbols, like for example result. In case a name cannot be found or is not relevant, a symbol can be created with % followed by a number (e.g. %0). The symbol @sum starts with @ because it is not local.

As defined in the original comment of this issue, attributes are added using brackets like this:

procedure @sum(i32 a, i32 b) -> i32 [ foo, bar, baz ... ] {

but in case the user wishes to reuse the same attributes in more than one place, a feature called attribute sets should be implemented that works like in LLVM:

attributes #0 = [ foo, bar, baz ... ]

procedure @sum(i32 a, i32 b) -> i32 #0 {      ; equivalent, #0 expands to a valid list of attributes

A meta-representation of instruction variant signatures

The signature of the different instruction variants used above look like this:

i32 = add i32, i32                 ; example: result i32 = add va i32, vb i32
store i32 -> i32*                  ; example: store i32 a -> i32* %0
i32 = load i32*                    ; example: va = load i32* %0
*i32 = slot type#i32               ; example: %0 = slot i32

Following the convention of CIL opcodes, we could use a dot and a suffix on mnemonics to indicate related instructions. For example, addition with an immediate could reuse the mnemonic add or use add.i:

i32 = add i32, imm#i32                 ; example: result i32 = add va i32, 7 i32
  ; OR
i32 = add.i i32, imm#i32               ; example: result i32 = add.i va i32, 7 i32

IR basics

Type ideas

Instruction ideas

C# builder API experience

I am not entirely convinced by the API shown in the example, because it requires accessing the fields here:

var cmpResult = builder.Cmp(Comparison.Less, arg, i32.NewValue(2));
var lastBb = builder.CurrentBasicBlock;  // <----

I think we could take inspiration from Microsoft's System.Reflection.Emit API and define a diferent builder for each logic level using the Builder pattern, so the Fibonacci example could be like this instead:

var assembly = new AssemblyBuilder();

var fib = new ProcedureBuilder("fib").WithParameter(i32, out var arg);
var then = new BasicBlockBuilder("then").Ret(i32.NewValue(1)).Build();
var @else = new BasicBlockBuilder("else")
  .Sub(arg, i32.NewValue(1), out var arg1)
  .Call(fib, new[] { arg1 })
  .Sub(arg, i32.NewValue(2), out var arg2)
  .Call(fib, new[] { arg2 })
  .Add(fib_1, fib_2, out var result)
  .Ret(result)
  .Build();
var start = new BasicBlockBuilder("entry").Br(Comparison.Less, arg, i32.NewValue(2), then, @else).Build();  // not compatible with branching instructions defined above, but ignore this detail for now
fib = fib.WithEntryAt(start).Build();
var assembly = assembly.WithProcedure(fib).Build();

Upcoming thoughts

LPeter1997 commented 2 years ago

My initial thoughts on the proposal. Overall: very nice, mentions a lot of important aspects. Well done!

Philosophy and design

Preamble: defining a syntax

A meta-representation of instruction variant signatures

IR basics

Type ideas

Instruction ideas

C# builder API experience

LPeter1997 commented 2 years ago

Just to get my hands dirty (and because I couldn't really just not work on it), I did a micro-experiment of how we would model types. For a challenge, I tried to model a type, that's a signed integer and can exactly store a pointer type on the given architecture. The construction code - without any builders - looks like so:

var voidPtr = new Types.Ptr(Model.Types.Void.Instance);
var voidPtrSize = new Values.SizeOf(voidPtr);
var intPtr = new Types.Int(voidPtrSize);
Console.WriteLine(intPtr);

This prints i{size_of(*void)}, which means "a signed integer with as many bits as size_of(*void) returns". This makes the model a bit more complex: we can't just have an integer argument as the bits width for the integer type, we need some IConstant instead that represents compile-time constant values.

I'm not yet sure that this complexity is worth it, but might be fine, if we can erase it at a later stage, when we receive platform-specific information. This would mean that we can do platform-independent optimizations while not resolving things like this, but as soon as we desugar things like size_of, more extreme optimizations can take place.

Maybe this also begs for aliases, reading i{N} or i{size_of(...)} can become pretty noisy, so type-aliases should be considered, so we could say type i32 = i{32} for example.

Note, that this makes type-checking a bit too strict, but that could be considered a good thing. i{size_of(*void)} cannot be known exactly until choosing a back-end, so it is only really compatible with the exact same signature, i{size_of(*void)}. Again, we should consider, if we want/need this complexity.

LPeter1997 commented 2 years ago

The power of attributes

I believe that a strong attribute system could help us in avoiding a lot of noise in the core IR itself. Attributes could contain debug information, as well as things, like the result of register allocation. It could be taken to the extreme, where the user could literally do manual register-allocation, if they wish so:

v2 = add v0, v1 [register = r0]

Many optimizations could be enforced, like [tail_call], [force_inline], ...

I'm sure there would be many-many uses for the attribute system, so we should make it rich, extensible and easy to read/write in the API.

The one thing we should avoid is having the attribute system modifying the semantics of the IR. Specifying platform-specific behavior should be fine, but attributes should be useful metadata, nothing more. The IR should execute the same on its theoretical machine without the attributes.