Background

Mit and Welly each include a specializing interpreter, intended as a prototype of this project. Mijit's goal is to generate code on-the-fly, in contrast to the prototype, which separates profiling from compiling.

We think we have already worked out a lot of the necessary concepts in the context of the prototype. Let me summarize them here.

Histories

The code labels generated by the JIT correspond to "histories". A history is roughly a contiguous run of executed instructions, but it can also include control flow decisions such as "if" conditions and the results of dynamic type checks.

The history corresponding to a code label may be understood as the shortest sequence of VM instructions (etc.) that would result in the label being visited, starting from the main entry point of the mijit_run function. The history may dually be understood as the VM instructions that need to be retired in order to exit the mijit_run function by the shortest path.

Metadata

In addition to the code compiled for each history, Mijit maintains additional data about each history, including:

the VM instructions it represents
prefix and suffix relationships with similar histories
profiling information (how frequently the history has been visited)
any other information that was calculated while constructing the history that might be useful when constructing longer histories of which it is a prefix or suffix, including:
- register allocation decisions
- optimization opportunities found

This metadata is not touched during normal execution (apart from incrementing the profiling counters). It is only needed for compiling new code.

Language

The finite set of histories (perhaps up to a million) for which code exists is the "language". The language grows over time as the JIT compiler discovers and compiles hot code.

Left and right trees

The language must satisfy some invariants at all times:

For any two histories in the language, their common prefix must also be in the language.
For any two histories in the language, their common suffix must also be in the language.

Thus, the language has an intricate double tree structure. The trees share a root (the empty history). Long histories are often leaves of both trees, but it is possible for a history to be a leaf of only one of the trees.

Fetch, retire

Control-flow proceeds along the branches of these two trees, as follows:

When we fetch VM instructions, the history grows on the right. The history before the fetch is a prefix of the history after the fetch. We are moving towards the leaves of the "right tree".
When we retire VM instructions, the history shrinks on the left. The history after the retirement is a suffix of the history before the retirement. We are moving towards the root of the "left tree".

Execute

VM instructions are fetched and retired in order. Each VM instruction is executed between being fetched and being retired.

The VM instructions may be executed out of order. In other words, for each history, the JIT has freedom to decide which VM instructions in the history are "executed before" entering the code corresponding to the history, with the rest being "executed after".

For now, the simplest idea is to execute VM instructions immediately on fetching them, i.e. in order. In other words, we decide that all the instructions in the history are "executed before". This is what we did in the prototype. I expect this design will perform reasonably well on out-of-order CPUs. Scheduling instructions for in-order CPUs is an optimization for the future.

Profiling

The best place to put the profiling counters is on the retire transitions. There are generally fewer retire than fetch transitions (i.e. we retire more instructions per transition), and there is opportunity to increment a counter in parallel with other things going on at the time.

We know that for every history the number of times control flow enters it is equal to the number of times control flow exits it. Therefore, we can compute by induction on the right tree:

how often each fetch transition is executed
how often each history is visited
how often each history occurred, including when bypassed via a longer history

These forms of the profiling information are much more useful; they can be used to decide when to extend the language, and which new histories to construct. It will be necessary to convert profiling information back and forth between the form that is efficient to collect and the forms that are useful.

Probability model

There are various circumstances in which we have to guess statistics about histories that don't exist. For example:

Compiling new code has a fairly large O(1) cost, e.g. calling mprotect and flushing and reloading caches. To amortize this cost, we must compile a reasonably large amount of code at a time. We must guess what histories are going to be useful.
After compiling new code, we must set the profiling counters to sensible values. We hope that the new code paths will be used in place of older, less specialized code paths. It would be wasteful to initialize to zero all the counters on the new code paths, because we have collected useful information about the old code paths. We should guess what values the counters would have reached if the histories had been constructed earlier.

The following approximations are plausible (f(h) is the number of times history h has occurred, including the times it was bypassed):

f(abc) / f(ab) is approximately f(bc) / f(b). They are both roughly the chance that b is followed by c; the additional information that b was preceded by a will usually make only a small difference.
f(abc) / f(bc) is approximately f(ab) / f(b), similarly.

These two rules of thumb are equivalent, and perhaps a more practical expression is that f(abc) is approximately f(ab) * f(bc) / f(b).

Cost model

As compiling new code is expensive, we want to model the cost.

We resist compiling new code until the counter on a retire transition exceeds some threshold, e.g. 1000. Probably the best implementation is a counter that starts at 1000 and counts down to zero, but let's pretend that the counter counts upwards. The purpose of the counter is to ensure that we spend as much effort executing code as compiling it. Effectively, it prevents us compiling the cold code. The counter is the cheapest mechanism I can imagine that will do this job.

The total of all the counters is a measure of the total execution time. If we arrange that all counters are usually somewhere fairly random between 0 and 1000, their total will grow in proportion to the amount of code we generate. I don't think it's necessary to reduce the counter values when we compile new code; instead we should reassign some of the counts to the new code, preserving the total.

Behaviour

There are four main modes of execution within mijit_run:

Fetching instructions, as part of normal execution.
Retiring instructions, as part of normal execution.
Retiring instructions, aiming for a trap or error. Mijit does not aim to implement the entire VM instruction set, and will leave rarely executed instructions to a TrapHandler that only works when the history is empty.
Retiring instructions, aiming to compile new code, which again can only be done when the history is empty.

Fetch and trap labels

For normal execution, we need one "fetch" label at which we can enter the code corresponding to each history. The code at that label first tries various fetch transitions. If successful, it (executes some instructions and) jumps to the fetch label of a longer history. If unsuccessful, it (executes some instructions and) follows the unique retire transition to the fetch label of a shorter history. These are hot paths.

For other cases, we need a "trap" label at which we can enter the code that will retire all the instructions in the history and return to the root history. The code at this label is identical with the code for retiring instructions during normal execution, except that it must jump to the trap label of the shorter history, not the fetch label. This is a cold path.

Sharing retire code

I think it is possible to share the code to retire instructions, without slowing down the hot path. If so, it is probably desirable, as it is quite a large fraction of the code.

As described above, the fetch label attempts various fetch transitions. If it is unsuccessful, it jumps to a shared "retire" label. The "retire" label retires some instructions and jumps to the fetch label of the shorter history. The same code has been executed as in the description above.

To effect a trap, we must set things up in such a way that all attempts to follow a "fetch" transition fail. For example, we may set the register than normally holds IR to a special value, while saving the real IR value somewhere else. There may be several such "special values", each indicating a different kind of trap.

The root

The retire label of the root history obviously must do something different (in the prototype this label was called A_FALLBACK). It must restore the correct value of IR, and dispatch to whatever code is appropriate for handling the trap. Examples include:

Raising an error, by returning from mijit_run with an error code.
Executing a TRAP instruction, by calling the TrapHandler callback.
Executing a rare instruction, e.g. DUP with a large index, also by calling the TrapHandler callback.
Compiling new code.

In all but the first case, execution continues at the fetch label of the root history.

Evolution

Let us imagine how the code for a history changes over time. At first, the history is not in the language and no code exists. Eventually, the history is added to to the language, and we construct its retire label. At this point its fetch label coincides with its retire label, i.e. there are no fetch transitions from this history. Then, we add fetch transitions, one at a time.

Each time we add a fetch transition, we compile "fetch code" for it. This code checks (e.g.) that the next few VM instructions match the transition, and if so executes appropriate code for the transition. This code must be inserted into the existing control flow path, after the existing fetch code for the history (if any) and before the retire label.

Patching

Inserting a chunk of code into an existing control-flow graph requires some care.

First, we must ensure that no thread is executing the code while we're modifying it. This is trivial in the single-threaded case.

Then, we must allocate memory for the new code, and fill it. The new code will generally not be contiguous with the old code. The new code ends with a jump to the old code (e.g. a retire label).

Then, we must find all instructions that jump to the old code, and rewrite them so that they jump to the new code. Jump instructions have a simple encoding, so this is quite feasible. However, it requires that we use the most general form of the jump instruction, with the largest available offset, even if it is not initially necessary.

Many jump instructions can be patched at once by making them indirect through a shared code pointer. However, this comes at the cost of performance and additional maintenance.

Summary

For each history, we need:

read/writable metadata, including execution counters
a retire label, with a fixed retire code block.
a fetch label, with a growing chain of fetch code blocks.

Some further thoughts follow, about:

the representation of histories
the algorithm for generating optimized code, and its complexity
the strategy for deciding what histories to compile

Histories, really

I said above "A history is roughly a contiguous run of executed instructions, but it can also include control flow decisions such as "if" conditions and the results of dynamic type checks". Let's firm that up: a history is a path through the control-flow graph of Mijit's bootstrap interpreter.

Finite state machine

In a traditional interpreter loop, every control-flow arc starts and ends in the same place, being the start of the loop. The history is then a string of those arcs, and any such string is a possible history. Then, the empty string is a prefix and suffix of all other histories, and the fetch and retire transitions form trees, with the empty history as the root, as I described above.

However, Mijit will need to support interpreters with a more complex control-flow graph. Mijit allows the control-flow of the bootstrap interpreter to be a finite state machine. Here are some examples of why that might be useful:

There could be multiple "modes" of interpretation, and instructions to switch between them.
For a CISC architecture, decoding an instruction might involve nested control flow constructs.
The VM code could be compressed, by changing the encoding of an instruction depending on the previous instruction.
It might just be easier to write the interpreter that way.

Therefore, in general, histories do not always start and end in the same state of the finite state machine. This means that not all strings of control flow arcs are valid histories; two histories can only be concatenated if one ends where the other starts.

Multiple roots

Having multiple control-flow states makes things slightly more complicated, though not really any more difficult:

There is an empty history for each state:
- It is a prefix of every history which starts in that state, and no others.
- It is a suffix of every history than ends in that state, and no others.
There is a right tree and a left tree for each state.
- The empty history for the state is the root node of both trees.
- Each history is a node in the right tree corresponding to its start state.
- Each history is a node in the left tree corresponding to its end state.
- Each fetch transition is an edge in exactly one of the right trees.
- Each retire transition is an edge in exactly one of the left trees.

Strings of booleans

The VM specification data structure defines the semantics of the VM by expressing the bootstrap interpreter in a domain-specific language in which the only control-flow construct is "if". When the Mijit boots up, the bootstrap interpreter will be compiled directly from that specification in a fairly naive way.

The states of the finite state machine are the "if" statements, and there are two control-flow arcs exiting each state. The number of control-flow arcs entering a state can vary.

This suggests that a history could be represented as its start state and a string of booleans. Its end state is implied. That idea works, and is nicely general, and is a useful concept. However, a string of booleans is probably not the best data structure to use to represent histories, because:

Histories can get very long, expressed in terms of unoptimized VM instructions, or anything that has a simple mapping to them, including booleans. It would be better to express histories in terms of the optimized code.
A string of booleans is quite abstract, and can only be understood by constantly consulting the VM specification data structure.

Nested structure

The way the JIT constructs histories naturally leads to a nested structure.

A history is a string of transitions

There is a unique path to each history through its right tree from the whichever empty history shares its start state. This expresses the history as a concatenation of fetch transitions. Dually, its left tree expresses it as a concatenation of retire transitions. This reduces the problem of representing histories to that of representing (say) fetch transitions.

Following a sequence of fetch transitions leads from the least specialized state (an empty history) through a sequence of gradually more specialized states. We hope that fetch transitions from specialized states do more work, measured e.g. as a number of booleans. If so, the length in transitions of a history should grow sub-linearly with its length in booleans.

A transition is a string of transitions

For each history, exactly one fetch transition enters it, and exactly one retire transition exits it. When a history is first constructed, these are the only transitions to or from the history. The hope is that the newly compiled code for these transitions is used instead of less specialized code that was compiled earlier. In more detail:

The new code consists of the new fetch transition followed by the new retire transition.
It replaces a sequence starting with a retire transition and ending with a fetch transition, possibly with additional fetch and retire transitions in between.

Thus, we can represent a fetch transition as the sequence of fetch transitions it replaces (the retire transitions are implied).

The base case

Thus, we can represent histories and transitions using an inductive structure. The inductive step was compiling new code. The base case is a transition that is part of the original bootstrap interpreter. The initial fetch transitions are in 2:1 correspondence with the "if" statements of the VM specification (since each "if" has two branches).

Optimizer

The optimizer is called once for each history, when the history is first constructed. It must generate code for the unique fetch transition that enters it and the unique retire transition that leaves it.

It will be useful to recall that the code for a fetch transition will start with an "if" that tests whether the transition can occur; the "if" condition is determined by the end state of the "before" history, and the transition advances the end state. The code for a retire transition from a history will be used only when all outgoing fetch transitions (if any) have been eliminated, i.e. their "if" conditions have been evaluated to false. The "if" condition of the eliminated transitions is determined by the end state of the "after" history, and the transition preserves that state.

The concatenation of the two new transitions represents a specialized code sequence which we hope to use in place of a less specialized sequence of existing transitions that were compiled earlier. The old sequence starts with a retire transition and ends with a fetch transition. In the case where some "if" conditions are proved to be constant, the old sequence may contain additional fetch transitions, and for various reasons it may contain additional retire transitions.

Inputs

The optimizer has the following inputs:

The code generated for the existing transitions. This code has already been optimized as much as their history permits, and it would be foolish to repeat that work. Obviously we maintain a copy of the code as a high-level data structure; we don't have to disassemble the native code.
The calling convention chosen for the histories before and after the code sequence.

Outputs

The optimizer has the following outputs:

The code for the new transitions.
The calling convention for the new history.

Calling conventions

The calling convention of the history before the sequence summarizes all the information we have accumulated about the state of the virtual machine by fetching that history. This includes, for example:

which registers and memory locations hold which computed results
the outcomes of all the "if"s we've executed

The calling convention of the history after the sequence summarizes all the information that we need in order to retire that history. This includes, for example:

which registers need to hold which computed results
which computed results and memory locations are dead

Cache

Part of the calling convention of a history, including the register allocation, may be interpreted as a cache state. Bits of virtual machine state, including memory contents and computed values, are held in locations that are cheap to access, including registers, spill slots, and even constants. This mental picture provides useful intuition and terminology.

In hardware CPUs, the cache state is dynamic; it is computed on the fly. In contrast, the optimizer must compute the cache state statically; anything else would be too expensive. In other words, at each point in the compiled code, the same values are always cached in the same places.

Dataflow graph

In overview, the optimizer has the following steps:

Symbolically execute the input code to make a dataflow graph.
Optimize the dataflow graph.
Schedule the instructions, i.e. decide what order they should be in.
Allocate registers and spill slots.
Decide where to split the code into the fetch transition and the retire transition.
Compute the calling convention at the split.

Constructing

The input code probably uses the cache suboptimally. Fetch transitions generally cache additional values, and retire transitions generally flush values from the cache. Therefore, any retire transition that comes before a fetch transition risks flushing values from the cache and then reloading them. Removing these inefficiencies is a large part of what the optimizer does.

The optimizer computes the dataflow graph by symbolically executing the input code, starting from the given "before" cache state, and recording what happens. Instructions that access (or otherwise compute) cached values are removed, or replaced by cheaper instructions. Instructions that access (or otherwise compute) uncached data are preserved, and their results are added to the cache. At the end, instructions are added to the dataflow graph to flush values from the cache in order to match the given "after" cache state.

Optimizing

Some optimizations are performed automatically by the cache:

constant propagation
copy propagation
constant folding
common sub-expression elimination
peephole optimization of memory accesses

An additional pass over the code is needed for other optimizations:

peephole optimization of arithmetic
dead code and value elimination

Scheduling

The simplest idea is to leave the instructions in their original order.

A more ambitious idea is to model how long it takes to compute each value, and to sort the instructions accordingly. This is an "inifinite parallelism" approximation.

A yet higher ambition is to model the behaviour of a particular target CPU, including the limits on its decoding and execution parallelism. Instructions on the critical path to the next "if" condition are given the highest priority, and are scheduled as early as possible. Other instructions are scheduled in parallel with them provided doing so is free. Any left over instructions are scheduled at the end.

Split

The simplest idea is to split the code just before the instructions that flush the cache. Everything before that point becomes the fetch transition, and everything after becomes the retire transition.

A more ambitious idea is to split the code at the earliest point where the next "if" condition can be tested.

Complexity

The cost of generating code for a history is roughly proportional to the size of the dataflow graph. This is bounded by the total length of the code for the replaced transitions. Importantly, that code has already been optimized. The original number of VM instructions may be much larger, but it is pleasingly irrelevant.

We may also hope that as we compile more specialized transitions, we do more useful work per transition. Suppose each new transition replaces a sequence of on average k transitions. Then, after specializing each part of the flow graph n times, we could get transitions that do O(k^n) work. Of course, I don't expect exponential speed-up in most programs; it is possible in ideal situations, but there are a number of factors that prevent code sequences being optimized away to nothing.

Strategy

Recall:

Profiling counters are attached to retire transitions.
When a counter overflows, we compile new code.
We initialize the new counters by estimating what values they would have if the new code had existed all along.
We decrement the old counters by the same amount, to preserve the total.

Rewinding execution

In the prototype, the strategy was to compile code starting at the history before the transition whose counter overflowed. Instead, I think it might be better to rewind execution to an earlier history, and start there. The goal is to generate code for a long history, in spite of the tendency for short histories to reach their counter threshold first.

It is not necessary to rewind history perfectly, and in particular it is not necessary to record the execution trace. We can use the probability model to follow the execution trace backwards, stopping if the probability of being right drops too low, and backtracking (forwards) if our guess turns out to be logically impossible. We must start the new code just before a retire transition, so it may be necessary to backtrack (forwards) a little more.

Multiguessing

In the prototype, the strategy was to construct the language as if fetch transitions fetch only single VM instructions, and then post-process the language to find opportunities to fetch multiple instructions at a time. In the context of a JIT, there is no opportunity to do a post-processing pass. Instead, we must construct fetch transitions with more specific "if" conditions than the ones they replace.

We can do this by tracing execution forwards using the probability model, stopping if the probability of being right drops too low. On this path, we collect "if" conditions that can be merged with the first one on the path. For example, if the first "if" tests a bit of a value, we collect "if"s that test other bits of the same value. For another example, if the first "if" tests that a value is less than a constant, we collect "if"s that compare the same value to other constants. This is fairly straightforward, because the possible "if" conditions are restricted by the domain-specific language used to specify the VM.

We then construct a new fetch transition whose "if" condition is the conjunction of all the collected conditions. Perhaps it will not be used, because the conjunction of conditions is never true; in that case we wait for a different profiling counter to overflow and compile an alternative fetch transition. For the case when the conjunction is true, we generate code up to the next "if" whose condition remains uncertain.

There will sometimes be cases where the conjunction is a stronger condition than is necessary to reach the next uncertain "if". For example, suppose the first three conditions are "bit 0 of IR; something else; bit 1 of IR". Then, the conjunction is "bits 0 and 1 of IR" but the new code will stop before testing "something else". It is nonetheless worth testing the stronger condition and leaving the result in the optimization cache, so that we can assume it when compiling the following fetch transition. For example, the following fetch transition might test "something else", then assume "bit 1 of IR".

apt1002 / mijit

What code do we want to generate? #2