emdash / udlang

A practical, functional language for stream processing.
GNU Lesser General Public License v3.0
1 stars 0 forks source link

Execution Model for uDLang MVP #36

Open emdash opened 3 years ago

emdash commented 3 years ago

uDLang executes as follows:

Front End

A program is read from a file and parsed into an AST.

The AST is processed:

Const folding

Load Time

The compiled, optimized script kernel is loaded into the interpreter. The implementation waits for the first record to be received on the input pipe. Once received:

The interpreter transitions to "runtime" mode.

Runtime

Interpreter loops in a cycle.

When execution encounters an out instruction:

When execution encounters a trap instruction:

When execution encounters a throw instruction:

IR

The IR is geared towards the straight-forward Rust implementation. It represents a trade-off between simplicity of the virtual machine, the simplicity of code generation, and other concerns such as code density and locality.

As such, there is tons of head-room for future optimizations here.

Programs

A program consists of a structure with the following data:

Blocks

Blocks are flat sequences of instructions in reverse-polish notation.

Upon return from a block, if the call-stack is not empty, the value stack is contracted to contain the correct return values according to the callable, and execution resumes from the calling block, at the return position indicated by the call stack.

If the call stack is empty, the result depends on which block was finished:

Block 0 (init)

The 0th block is interpreted as a top-level procedure which performs load-time initialization. This block will be called once during the life-cycle of the program, at load time. It roughly corresponds to the top-level statements in a script which occur before the input / output declarations, though I am considering a mechanism to allow explicitly placing user code in this section (see #16).

Code in this block may not use the in in instruction. Nor may it call another block which uses the in instruction.

Code in this block may use sys instruction. Code in this block may use the out instruction. out instructions in this block yield output only once during the lifetime of the program.

The arg instruction is interpreted as indexing the list of command-line parameters

Block 1 (main)

The 1st block is interpreted as the program entry point.

This block will be called once on each input record that the program is asked to process during its life. Code in this block may use the in, sys, and out instructions, and may call any other block besides 0 or 1.

The arg instruction is interpreted as indexing the list of command-line parameters.

Block n

Remaining block indices are reserved for future use.

Instructions

Instructions are encoded as a Rust enumeration. The instruction set is chosen to allow a fixed-sized instruction word of 64-bits or smaller, and so the use of immediate values is limited to u16 or smaller. All instructions operate on an implicit value stack.

Instruction::Load(atom) Instruction::Store(atom)

Load or store a local by name.

The item moves between the value stack and the locals within the current scope.

Instruction::Const(Addr)

Places a constant value onto the value stack.

Instruction::LCons(n), Instruction::MCons(n)

Dynamically constructs a list (or map) from the top n (or top 2 * n) items on the stack.

Instruction::In

in: place the input record onto the stack

Instruction::Call(CallType)

Call types:

Instruction::Out

Instruction::Debug

Send a string representation of the top of stack, without consuming it, to stderr.

Instruction::Placeholder(Single)

Place the "identity thunk" onto the stack.

IR::Index(IndexType)

Retrieve the appropriate element from the given collection. Will trap with TypeMismatch if the collection type is not congruent.

Instruction::Matches(TypeTag)

Tests whether the given value matches the given type. TypeTag is discussed elsewhere.

Instruction::Coerce(TypeTag)

Tries to convert the value at the top of stack to the type given by the tag. Only the following coercions are defined, and some of them may fail at runtime.

Instruction::Binary(BinOp), Instruction::Unary(UnOp)

Blanket instruction wrapper for the usual family of arithmetic and logic operations on int, float, and string. In addition, BinOp::Add is overloaded to mean concatenation for lists and strings, and union for maps.

Types and Values

Values are typed, and carry their type with them (Value is rust enum). Values flow from a source instruction, through zero or more operations, to a sink instruction:

Source Instructions:

Sink instructions:

Primitive Types

These are "unboxed" in the implementation.

These types are heap-allocated:

Type Values

uDLang needs runtime reflection to perform input validation

TBD.

Need to be able to represent arbitrary shapes:

Idea: postfix type notation...

Callable Types

The primitive callable a Block value:

A callable is a sequence of instructions plus meta-data:

Closure:

Thunk:

The implementation may make a type-level distinction between functions, closures, bound-methods and thunks. While closely related, each may need special handling for performance reasons.

Partial Evaluation (Thunks)

See Issue #37

Collections / Composite Types

Lists support numeric indexing. Lists support concatenation (+). An lcons(n) operation creates a new list from the top N stack elements. A map call over a list expects a function of one argument. A fold expects a function of two arguments.

Records support arbitrary key indexing. Records support update, union, and intersection operations. mapcons(n) creates a map from the top (2 * n) stack elements, interpreting them as key, value pairs. A map call over a record expects a function of two arguments. A fold call expects a function of 3 arguments.