Execution Model for uDLang MVP

uDLang executes as follows:

Front End

A program is read from a file and parsed into an AST.

The AST is processed:

library dependencies are resolved and recursively compiled
- module values placed into top-level scope
bidirectional type-check pass is performed
- some type errors eagerly detected here
- check function call arity and argument types
- condition expressions must be boolean
- function return value must match its return type
- function arguments must match argument types
- union types require a corresponding match expression
- there should be a corresponding case pattern for each branch of the union.
- or there should be a wildcard match
- check that type at each out statement matches declared Output.
IR Lowering
- nested tree structure -> instruction sequences
stack-folding (inlining pass / futamura projection).
- catches remaining type errors
- inlining, const-folding, and dead code elimination (*)

Const folding

every sub-expression depending only on values known at compile time should fold to a single, constant value (which may be a list, map, or record value).
If a program does nothing with its input, in any code path, we should raise an error.
A const-folded expression should always produce the same value as the equivalent runtime operation.

Load Time

The compiled, optimized script kernel is loaded into the interpreter. The implementation waits for the first record to be received on the input pipe. Once received:

The first record is the shape record.
The interpreter checks the shape of the record against the input type declared by the script.
A corresponding shape record is sent downstream, corresponding to the output type.

The interpreter transitions to "runtime" mode.

Runtime

Interpreter loops in a cycle.

input record is decoded
script execution begins at runtime entry point
script exits when it encounters an out instruction, or reaches the end of the main block.

When execution encounters an out instruction:

the top of stack is written to stdout, and stdout is flushed.
end of current cycle..

When execution encounters a trap instruction:

pop the pattern and handler function from top of stack
sets the current "handler" for pattern as the given function on the current frame.

When execution encounters a throw instruction:

unwind the stack until we reach a frame which can handle the exception.
resume execution from the handler.

IR

The IR is geared towards the straight-forward Rust implementation. It represents a trade-off between simplicity of the virtual machine, the simplicity of code generation, and other concerns such as code density and locality.

As such, there is tons of head-room for future optimizations here.

Programs

A program consists of a structure with the following data:

an input type schema
an output type schema
a set constant values, which can be referenced by index
a list of blocks.

Blocks

Blocks are flat sequences of instructions in reverse-polish notation.

Static blocks are indexed numerically, with block index 0 and 1 have special meaning (see below).
Within a block the arg instruction transfers positional arguments from the given frame to the stack.
Reaching the end of a block is an implicit return.
The ret instruction can be used for early return from a block.
The throw instruction unwinds the call stack until a frame with an appropriate handler is reached.

Upon return from a block, if the call-stack is not empty, the value stack is contracted to contain the correct return values according to the callable, and execution resumes from the calling block, at the return position indicated by the call stack.

If the call stack is empty, the result depends on which block was finished:

0th or 1st blocks, control returns to the implementation.
any other block, control passes to an error handler

Block 0 (`init`)

The 0th block is interpreted as a top-level procedure which performs load-time initialization. This block will be called once during the life-cycle of the program, at load time. It roughly corresponds to the top-level statements in a script which occur before the input / output declarations, though I am considering a mechanism to allow explicitly placing user code in this section (see #16).

Code in this block may not use the in in instruction. Nor may it call another block which uses the in instruction.

Code in this block may use sys instruction. Code in this block may use the out instruction. out instructions in this block yield output only once during the lifetime of the program.

The arg instruction is interpreted as indexing the list of command-line parameters

Block 1 (`main`)

The 1st block is interpreted as the program entry point.

This block will be called once on each input record that the program is asked to process during its life. Code in this block may use the in, sys, and out instructions, and may call any other block besides 0 or 1.

The arg instruction is interpreted as indexing the list of command-line parameters.

Block n

Remaining block indices are reserved for future use.

Instructions

Instructions are encoded as a Rust enumeration. The instruction set is chosen to allow a fixed-sized instruction word of 64-bits or smaller, and so the use of immediate values is limited to u16 or smaller. All instructions operate on an implicit value stack.

`Instruction::Load(atom)` `Instruction::Store(atom)`

Load or store a local by name.

The item moves between the value stack and the locals within the current scope.

`Instruction::Const(Addr)`

Places a constant value onto the value stack.

`Instruction::LCons(n)`, `Instruction::MCons(n)`

Dynamically constructs a list (or map) from the top n (or top 2 * n) items on the stack.

`Instruction::In`

in: place the input record onto the stack

`Instruction::Call(CallType)`

Call types:

CallType::Always - consumes callable and args
CallType::If - <args...> <boolean> <callable>
CallType::IfElse - <args...> <boolean> <callable> <callable>
CallType::Map - <iterable> <callable>
CallType::Fold - <value> <iterable> <callable>

`Instruction::Out`

Sends top of stack to output stream or current capture, and ends processing of the current cycle.
Returns control to the implementation.

`Instruction::Debug`

Send a string representation of the top of stack, without consuming it, to stderr.

`Instruction::Placeholder(Single)`

Place the "identity thunk" onto the stack.

`IR::Index(IndexType)`

IndexType::Array
IndexType::Record

Retrieve the appropriate element from the given collection. Will trap with TypeMismatch if the collection type is not congruent.

`Instruction::Matches(TypeTag)`

Tests whether the given value matches the given type. TypeTag is discussed elsewhere.

`Instruction::Coerce(TypeTag)`

Tries to convert the value at the top of stack to the type given by the tag. Only the following coercions are defined, and some of them may fail at runtime.

I32 -> F32
U32 -> F32
I32 -> U32
F32 -> F64

`Instruction::Binary(BinOp)`, `Instruction::Unary(UnOp)`

Blanket instruction wrapper for the usual family of arithmetic and logic operations on int, float, and string. In addition, BinOp::Add is overloaded to mean concatenation for lists and strings, and union for maps.

Types and Values

Values are typed, and carry their type with them (Value is rust enum). Values flow from a source instruction, through zero or more operations, to a sink instruction:

Source Instructions:

Instruction::In - value depends on runtime data
Instruction::Load - lexical binding
Instruction::Const - load from constant table

Sink instructions:

Instruction::Out - value is part of program output
Instruction::Store - lexical binding

Primitive Types

These are "unboxed" in the implementation.

Unicode Strings (utf8)
Bytestrings (raw)
U8, U16, U32, U64
I8, I16, I32, I64
F32, F64
Int, UInt, Float

These types are heap-allocated:

TypeTag
- List
- Record

Type Values

uDLang needs runtime reflection to perform input validation

TBD.

We can either use msgpack extension fields.
Or we can define an "in-band" encoding for type values.

Need to be able to represent arbitrary shapes:

(Int, Float) -> [Str]
"three" | "four" | {five: Int}

Idea: postfix type notation...

TInt TList TInt TMap TInt
TInt TFloat TStr TFunc(2)
Const("three") TVal Const("four") Const("five") TVal TInt TField TRecord(1) TSum(3)
Now we can represent any type shape on the stack
patterns could be expressed in a similar fashion

Callable Types

The primitive callable a Block value:

A callable is a sequence of instructions plus meta-data:

named list of arguments
number of returns (must be 0 or 1).

Closure:

a tuple of (Block, captures)
These spring into existence when a block is returned or passed as a function argument.

Thunk:

special case of closure for partial evaluation
see #37

The implementation may make a type-level distinction between functions, closures, bound-methods and thunks. While closely related, each may need special handling for performance reasons.

Partial Evaluation (Thunks)

See Issue #37

Collections / Composite Types

list
record

Lists support numeric indexing. Lists support concatenation (+). An lcons(n) operation creates a new list from the top N stack elements. A map call over a list expects a function of one argument. A fold expects a function of two arguments.

Records support arbitrary key indexing. Records support update, union, and intersection operations. mapcons(n) creates a map from the top (2 * n) stack elements, interpreting them as key, value pairs. A map call over a record expects a function of two arguments. A fold call expects a function of 3 arguments.

emdash / udlang