inlining, const-folding, and dead code elimination (*)
Const folding
every sub-expression depending only on values known at compile time should fold to a single, constant value (which may be a list, map, or record value).
If a program does nothing with its input, in any code path, we should raise an error.
A const-folded expression should always produce the same value as the equivalent runtime operation.
Load Time
The compiled, optimized script kernel is loaded into the interpreter.
The implementation waits for the first record to be received on the input pipe. Once received:
The first record is the shape record.
The interpreter checks the shape of the record against the input type declared by the script.
A corresponding shape record is sent downstream, corresponding to the output type.
The interpreter transitions to "runtime" mode.
Runtime
Interpreter loops in a cycle.
input record is decoded
script execution begins at runtime entry point
script exits when it encounters an out instruction, or reaches the end of the main block.
When execution encounters an out instruction:
the top of stack is written to stdout, and stdout is flushed.
end of current cycle..
When execution encounters a trap instruction:
pop the pattern and handler function from top of stack
sets the current "handler" for pattern as the given function on the current frame.
When execution encounters a throw instruction:
unwind the stack until we reach a frame which can handle the exception.
resume execution from the handler.
IR
The IR is geared towards the straight-forward Rust implementation. It represents a trade-off between simplicity of the virtual machine, the simplicity of code generation, and other concerns such as code density and locality.
As such, there is tons of head-room for future optimizations here.
Programs
A program consists of a structure with the following data:
an input type schema
an output type schema
a set constant values, which can be referenced by index
a list of blocks.
Blocks
Blocks are flat sequences of instructions in reverse-polish notation.
Static blocks are indexed numerically, with block index 0 and 1 have special meaning (see below).
Within a block the arg instruction transfers positional arguments from the given frame to the stack.
Reaching the end of a block is an implicit return.
The ret instruction can be used for early return from a block.
The throw instruction unwinds the call stack until a frame with an appropriate handler is reached.
Upon return from a block, if the call-stack is not empty, the value stack is contracted to contain the correct return values according to the callable, and execution resumes from the calling block, at the return position indicated by the call stack.
If the call stack is empty, the result depends on which block was finished:
0th or 1st blocks, control returns to the implementation.
any other block, control passes to an error handler
Block 0 (init)
The 0th block is interpreted as a top-level procedure which performs load-time initialization. This block will be called once during the life-cycle of the program, at load time. It roughly corresponds to the top-level statements in a script which occur before the input / output declarations, though I am considering a mechanism to allow explicitly placing user code in this section (see #16).
Code in this block may not use the in in instruction. Nor may it call another block which uses the in instruction.
Code in this block may use sys instruction. Code in this block may use the out instruction. out instructions in this block yield output only once during the lifetime of the program.
The arg instruction is interpreted as indexing the list of command-line parameters
Block 1 (main)
The 1st block is interpreted as the program entry point.
This block will be called once on each input record that the program is asked to process during its life. Code in this block may use the in, sys, and out instructions, and may call any other block besides 0 or 1.
The arg instruction is interpreted as indexing the list of command-line parameters.
Block n
Remaining block indices are reserved for future use.
Instructions
Instructions are encoded as a Rust enumeration. The instruction set is chosen to allow a fixed-sized instruction word of 64-bits or smaller, and so the use of immediate values is limited to u16 or smaller. All instructions operate on an implicit value stack.
Instruction::Load(atom)Instruction::Store(atom)
Load or store a local by name.
The item moves between the value stack and the locals within the current scope.
Instruction::Const(Addr)
Places a constant value onto the value stack.
Instruction::LCons(n), Instruction::MCons(n)
Dynamically constructs a list (or map) from the top n (or top 2 * n) items on the stack.
Sends top of stack to output stream or current capture, and ends processing of the current cycle.
Returns control to the implementation.
Instruction::Debug
Send a string representation of the top of stack, without consuming it, to stderr.
Instruction::Placeholder(Single)
Place the "identity thunk" onto the stack.
IR::Index(IndexType)
IndexType::Array
IndexType::Record
Retrieve the appropriate element from the given collection. Will trap with TypeMismatch if the collection type is not congruent.
Instruction::Matches(TypeTag)
Tests whether the given value matches the given type. TypeTag is discussed elsewhere.
Instruction::Coerce(TypeTag)
Tries to convert the value at the top of stack to the type given by the tag.
Only the following coercions are defined, and some of them may fail at runtime.
Blanket instruction wrapper for the usual family of arithmetic and logic operations on int, float, and string. In addition, BinOp::Add is overloaded to mean concatenation for lists and strings, and union for maps.
Types and Values
Values are typed, and carry their type with them (Value is rust enum). Values flow from a source instruction, through zero or more operations, to a sink instruction:
Source Instructions:
Instruction::In - value depends on runtime data
Instruction::Load - lexical binding
Instruction::Const - load from constant table
Sink instructions:
Instruction::Out - value is part of program output
Instruction::Store - lexical binding
Primitive Types
These are "unboxed" in the implementation.
Unicode Strings (utf8)
Bytestrings (raw)
U8, U16, U32, U64
I8, I16, I32, I64
F32, F64
Int, UInt, Float
These types are heap-allocated:
TypeTag
List
Record
Type Values
uDLang needs runtime reflection to perform input validation
TBD.
We can either use msgpack extension fields.
Or we can define an "in-band" encoding for type values.
A callable is a sequence of instructions plus meta-data:
named list of arguments
number of returns (must be 0 or 1).
Closure:
a tuple of (Block, captures)
These spring into existence when a block is returned or passed as a function argument.
Thunk:
special case of closure for partial evaluation
see #37
The implementation may make a type-level distinction between functions, closures, bound-methods and thunks. While closely related, each may need special handling for performance reasons.
Partial Evaluation (Thunks)
See Issue #37
Collections / Composite Types
list
record
Lists support numeric indexing. Lists support concatenation (+). An lcons(n) operation creates a new list from the top N stack elements. A map call over a list expects a function of one argument. A fold expects a function of two arguments.
Records support arbitrary key indexing. Records support update, union, and intersection operations. mapcons(n) creates a map from the top (2 * n) stack elements, interpreting them as key, value pairs. A map call over a record expects a function of two arguments. A fold call expects a function of 3 arguments.
uDLang executes as follows:
Front End
A program is read from a file and parsed into an AST.
The AST is processed:
library dependencies are resolved and recursively compiled
bidirectional type-check pass is performed
out
statement matches declaredOutput
.IR Lowering
stack-folding (inlining pass / futamura projection).
Const folding
Load Time
The compiled, optimized script kernel is loaded into the interpreter. The implementation waits for the first record to be received on the input pipe. Once received:
shape
record.input
type declared by the script.shape
record is sent downstream, corresponding to the output type.The interpreter transitions to "runtime" mode.
Runtime
Interpreter loops in a cycle.
out
instruction, or reaches the end of the main block.When execution encounters an
out
instruction:stdout
, andstdout
is flushed.When execution encounters a
trap
instruction:When execution encounters a
throw
instruction:IR
The IR is geared towards the straight-forward Rust implementation. It represents a trade-off between simplicity of the virtual machine, the simplicity of code generation, and other concerns such as code density and locality.
As such, there is tons of head-room for future optimizations here.
Programs
A program consists of a structure with the following data:
Blocks
Blocks are flat sequences of instructions in reverse-polish notation.
arg
instruction transfers positional arguments from the given frame to the stack.ret
instruction can be used for early return from a block.throw
instruction unwinds the call stack until a frame with an appropriate handler is reached.Upon return from a block, if the call-stack is not empty, the value stack is contracted to contain the correct return values according to the callable, and execution resumes from the calling block, at the return position indicated by the call stack.
If the call stack is empty, the result depends on which block was finished:
Block 0 (
init
)The 0th block is interpreted as a top-level procedure which performs load-time initialization. This block will be called once during the life-cycle of the program, at load time. It roughly corresponds to the top-level statements in a script which occur before the input / output declarations, though I am considering a mechanism to allow explicitly placing user code in this section (see #16).
Code in this block may not use the
in
in instruction. Nor may it call another block which uses thein
instruction.Code in this block may use
sys
instruction. Code in this block may use theout
instruction.out
instructions in this block yield output only once during the lifetime of the program.The
arg
instruction is interpreted as indexing the list of command-line parametersBlock 1 (
main
)The 1st block is interpreted as the program entry point.
This block will be called once on each input record that the program is asked to process during its life. Code in this block may use the
in
,sys
, andout
instructions, and may call any other block besides 0 or 1.The
arg
instruction is interpreted as indexing the list of command-line parameters.Block n
Remaining block indices are reserved for future use.
Instructions
Instructions are encoded as a Rust enumeration. The instruction set is chosen to allow a fixed-sized instruction word of 64-bits or smaller, and so the use of immediate values is limited to u16 or smaller. All instructions operate on an implicit value stack.
Instruction::Load(atom)
Instruction::Store(atom)
Load or store a local by name.
The item moves between the value stack and the locals within the current scope.
Instruction::Const(Addr)
Places a constant value onto the value stack.
Instruction::LCons(n)
,Instruction::MCons(n)
Dynamically constructs a list (or map) from the top n (or top 2 * n) items on the stack.
Instruction::In
in
: place the input record onto the stackInstruction::Call(CallType)
Call types:
CallType::Always
- consumes callable and argsCallType::If
-<args...> <boolean> <callable>
CallType::IfElse
-<args...> <boolean> <callable> <callable>
CallType::Map
-<iterable> <callable>
CallType::Fold
-<value> <iterable> <callable>
Instruction::Out
Instruction::Debug
Send a string representation of the top of stack, without consuming it, to
stderr
.Instruction::Placeholder(Single)
Place the "identity thunk" onto the stack.
IR::Index(IndexType)
IndexType::Array
IndexType::Record
Retrieve the appropriate element from the given collection. Will trap with
TypeMismatch
if the collection type is not congruent.Instruction::Matches(TypeTag)
Tests whether the given value matches the given type.
TypeTag
is discussed elsewhere.Instruction::Coerce(TypeTag)
Tries to convert the value at the top of stack to the type given by the tag. Only the following coercions are defined, and some of them may fail at runtime.
I32 -> F32
U32 -> F32
I32 -> U32
F32 -> F64
Instruction::Binary(BinOp)
,Instruction::Unary(UnOp)
Blanket instruction wrapper for the usual family of arithmetic and logic operations on int, float, and string. In addition,
BinOp::Add
is overloaded to mean concatenation for lists and strings, and union for maps.Types and Values
Values are typed, and carry their type with them (Value is rust enum). Values flow from a source instruction, through zero or more operations, to a sink instruction:
Source Instructions:
Instruction::In
- value depends on runtime dataInstruction::Load
- lexical bindingInstruction::Const
- load from constant tableSink instructions:
Instruction::Out
- value is part of program outputInstruction::Store
- lexical bindingPrimitive Types
These are "unboxed" in the implementation.
These types are heap-allocated:
Type Values
uDLang needs runtime reflection to perform input validation
TBD.
Need to be able to represent arbitrary shapes:
Idea: postfix type notation...
TInt TList TInt TMap TInt
TInt TFloat TStr TFunc(2)
Const("three") TVal Const("four") Const("five") TVal TInt TField TRecord(1) TSum(3)
Callable Types
The primitive callable a
Block
value:A callable is a sequence of instructions plus meta-data:
Closure:
(Block, captures)
Thunk:
The implementation may make a type-level distinction between functions, closures, bound-methods and thunks. While closely related, each may need special handling for performance reasons.
Partial Evaluation (Thunks)
See Issue #37
Collections / Composite Types
Lists support numeric indexing. Lists support concatenation (+). An lcons(n) operation creates a new list from the top N stack elements. A map call over a list expects a function of one argument. A fold expects a function of two arguments.
Records support arbitrary key indexing. Records support update, union, and intersection operations.
mapcons(n)
creates a map from the top (2 * n) stack elements, interpreting them as key, value pairs. A map call over a record expects a function of two arguments. A fold call expects a function of 3 arguments.