Adding a tiered JIT to Empirical

Empirical compiles to VVM bytecode, which is then interpreted by dispatching to a runtime. VVM has a linear IR and has statically-typed vector-aware opcodes. This has excellent performance for many simple cases, but falls apart for compound vector expressions or for scalar operations in a long loop.

Motivation

Consider adding three vectors together:

result = xs + ys + zs

The first addition creates a temporary; the expression is equivalent to:

temp = xs + ys
result = temp + zs

Vector temporaries are not ideal from a performance standpoint because of the need to (1) manage such short-lived memory and (2) repeatedly scan previously seen data. Instead, we want to fuse the operators:

for i in range(xs):
  result[i] = xs[i] + ys[i] + zs[i]

Such fusion is only possible within the scope of a compiler.

Additionally, Empirical's VM has overhead for each instruction dispatch. This overhead is effectively hidden during large vector operations, but appears during looped scalar operations. A scalar add, for example, should really be a single x86 instruction over a pair of registers.

Converting scalar operations is also only possible via a compiler.

Compiler

A long-term goal for Empirical is to be able to compile VVM to LLVM. The IR was intended to be fairly compatible with LLVM along with vector-aware opcodes. (VVM was created before MLIR was released; while MLIR might be useful here, VVM must have a linear IR to make interpreting fast.)

Compilation should simply fuse the vector operators in an explicit loop while keeping most everything else the same. (Type parameters to certain VVM instructions will require their own works, but that will be a future discussion.) LLVM should be able to handle register assignments on its own.

Fortunately, there is a lot of prior art on fusion, including numerous production-quality libraries from the machine-learning field. There should not be much reinventing the wheel here.

Tiered JIT

If VVM can compile to LLVM, that opens the possibility for both AOT and JIT applications. Regarding the latter, we want to continue interpreting in the foreground while compiling in the background; this is known as a tiered JIT and maintains fast response times for simple code snippets. In other words, the user doesn't have to wait for a full compilation if the interpreter finishes first.

Because VVM is statically typed, there is no need for a trace mechanism; we already know the types and we can begin compiling the user's code right away. The caveat that arrises is that we want the compiled version to pick-up from the interpreter's current state---ie., the compiled version should not duplicate the work that the foreground interpreter already did. Presumably, this will be handled via on-stack replacement, though hopefully VVM will be easier because types are known ahead of time.

Plan of Action

Adding a tiered JIT will be a significant undertaking and will require tons of research:

operator fusion: [ 1, 2 ]
on-stack replacement: [ 3, 4 ]
MLIR (likely won't use since VVM should be able to compile to LLVM on its own, but the ideas here are useful)

As part of the development process, much of VVM's runtime will likely be spun-out as a standalone library (albeit still under the Empirical repo). The reason is that compiled code---particularly AOT---will need to link against the various functions without going through VVM's dispatch.

empirical-soft / empirical-lang