exo-lang / exo

Exocompilation for productive programming of hardware accelerators
https://exo-lang.dev
MIT License
292 stars 28 forks source link

Target-specific, User-defined Libraries #474

Open rachitnigam opened 1 year ago

rachitnigam commented 1 year ago

I was thinking through a specific use case for Exo: what would it take to convince people building small, but custom compilers for their internal accelerator X to instead use Exo? We've recently been thinking about building higher-level libraries that mimic the capabilities of other user-schedulable languages like Halide but it is not clear to me that this use case is satisfied by that.

Bear with me with this setup for a moment. Here is a table of targets ($T_i$) and kernels ($K_i$):

K1 K2 K3
T1
T2
T3

An Exo approach requires $9$ different schedule in this case: $3$ schedules for $3$ targets. When writing these schedules, there are three different ways to abstract the scheduling code to enable reuse:

In reality, you probably want a mix of all three: some scheduling operations that take advantage of the application domain, some that take advantage of the target, and some that are generic. The domain operators are used initially to aggressively reshape the program while the target-specific operators are used right before instruction mapping. However, I seems to me that in order to replace the existing "small DSL + compiler for my accelerator" use case, the second style of abstraction is important.

I think operationally, this doesn't change the short-term goal of having more user-defined scheduling operations without abstraction. However, I think we could pitch these target abstractions as a part of the story too.

rachitnigam commented 1 year ago

A couple of specific examples of these from conversations with @jrk and @gilbo:

Register Allocation

Register allocation is one of those things that you don't think about too much unless doing low-level perf engineering because it is completely automated in most compilers and AFAIK, not influenceable from the source-level program.

This kind of automation, from Exo-perspective, is a target-specific abstraction: for each backend you can imagine writing a register allocation pass that automatically attempts to assign the right registers for all the buffers in your program. Next, you can imagine extending the memory API so that your allocator tells you which registers had to be spilled. For example, this original program:

A: f32[10] 
B: f32[16]
...

gets transformed into:

A: f32[10] @ AVX2 & Spilled
B: f32[16] @ AVX2
...

However, A is used in a perf critical section while B is used to move data in and out. We can use .set_memory to change the allocation of our program.

Of course, with this automation and low-level rearrangement, we might even want an analysis that ensures that our given allocation is valid for the target!

Register Allocation Analysis

Analysis and abstractions go hand in hand: if we want users to be able to benefit from high-level scheduling operations while still having low-level control over things, we should provide a way to define new analyses. For example, register- (or memory-) allocation is such a common task, you can imagine providing a way to build a new memory-allocation analysis:

mem_alloc = exo.analysis_builder.MemoryAllocation(
  target = "intel-amx",
  registers = {
    "single-precision": 32,        # Completely made up
    "double-precision": 16,
    "exclusive": False
}

These analyses could then be used in conjunction of high-level scheduling operators to allow fine-grained control with guarantees.

Vectorization

This is a very general instance of a domain abstraction. However, vectorizers often have to make crucial, target-specific decisions like whether or not to use predication or explicit masks (or operate over an abstract IR).

The Exo approach here could be building a dead-simple, and predictable vectorizer that gives up when it see complex branching code and requires the programmer to pick the right strategy to handle them for the particular backend they want to target.

Of course, using this dead simple vectorizer, you can imagine implementing a more sophisticated vectorizer that takes a list of scheduling operators to try to apply when stuck. This is the power of composition: a tower of abstractions that are all rooted in simple, predictable operators.