Understanding of data structures relevant to fngi design

civboot / fngi

a readable language that grows from the silicon

The Unlicense

60 stars 3 forks source link

Understanding of data structures relevant to fngi design #21

Open orbitaldecay opened 1 year ago

orbitaldecay commented 1 year ago

@vitiral After spending some time parsing our conversations and reading over the compiler source, I have the beginning of a rudimentary understanding of the design of fngi. I will expound on it here, both to give you an opportunity to correct any misunderstandings I have and for the benefit of others who come after me and seek to understand.

Fngi is essentially stack-based. There are several run-time data structures and several compile-time data structures that are relevant to a fngi programmer.

Significant run-time data structures are:

The run-time working stack. This is explicit in the spor interpreter and isn't part of the fngi compiler proper. Say one compiles a fngi program, such as 1 + 3. When the resulting program is executed, the values 1 and 3 are pushed to the run-time working stack, then the + function pops two values from the run-time working stack and pushes their sum (4) back to the run-time working stack.
The run-time call stack. This is used for return addresses for functions declared with fn in the usual way.

Significant compile-time data structures are:

The type dictionaries. These are TyDb and TyDbImm in the fngi compiler source code. These are used for type checking regular functions and functions which are executed during compile-time, respectively. The details aren't particularly important to this discussion.
The compile-time working stack. This is explicit in the fngi compiler (WS_ADD and WS_POP are used to access it). Functions can be executed during compile-time by prepending the function token with imm#. In the fngi compiler source, this sets the asImm flag to true on single() which is passed recursively. In this case, they use the compile-time working stack rather than the run-time working stack (since obviously the run-time working stack doesn't exist yet). This allows for the creation of a sort of macro in fngi itself. Because the compiler machinery is exposed as functions in the fngi language, this basically allows you to rewrite the compiler via macros a la FORTH.
The compile-time call stack? Not clear on this, didn't dig deep enough.

A specific type of function is the syn function. These behave slightly differently than regular functions in a couple ways, but most significantly this is a compile-time function by default (that is, asImm is true by default- doesn't require the use of imm# to achieve this). Syn functions receive a single parameter on the compile-time working stack: a single boolean which indicates the current state of asImm in the compiler call chain. This flag is generally ignored and the syn function is simply executed (haven't looked to deep for exceptions). The compiler passes this parameter. I think syn functions are really more specific to compiler implementation than the actual design of the fngi language. That is, a fngi compiler could theoretically be written without the concept of a syn function, though it would dramatically complicate the grammar (correct me if I'm wrong about that).

This is the understanding I have at the moment (complete with inaccuracies I'm sure!). Please correct what I'm not getting. Thanks!

vitiral commented 1 year ago

I think there is some confusion of terminology. Part of this is my fault, the other part is that fngi doesn't do things in a "conventional way" (which is also my fault) and FORTH was never exactly known for easy to understand terminology.

Definitions

So let's define some terms:

Compiler: the program that runs on fngi source code which has a set runtime: it starts when the user invokes the compiler program and ends when all source-code tokens have been compiled.
Compile (verb: compiled, compiling, etc): the action of the compiler -- nothing else. We won't be using this word very much anymore, since it often makes things confusing :D
executed: when a function is executed, aka the spor VM runs it. The function could be defined either natively (in C) or in spor. Functions can be executed at any time after they have been fully defined.
assembled: the act of writing spor assembly to memory, typically to the currently assembling function. Something executed (in the future) in the currently assembling function would be executed at "assembly time" or "asmTime".
immediate: causes the compiler to execute a function instead of assembling it. Something executed at such a time is executed in "immediate time" or "immTime".

Data Structures

The Spor VM consists of:

A "Kern" object. Essentially this has the global state of the VM. This includes
- A fiber pointer which has a return stack and a working stack. Eventually there will be more than one fiber, and we will be able to do concurrency.
- Global variables, which includes the current compiler state -- such as the TyDb's and the current file/s being compiled (where tokens are coming from), as well as the current token. Most compiler state is in global variables.

Now as you say, fngi is essentially stack based. This even applies to fngi's linked lists: which add/pop from the start (instead of the end) and so are basically a stack. The only non-stack-like data structures I can think of are:

Bst: binary search tree, used for the fngi dictionaries.
Doubly linked lists: used for the block allocator (to track 4KiB blocks).

Just as many data structures in fngi are stacks, many algorithms use stacks. Recursion is a very common technique.

All of these types are (theoretically) present for any fngi function no matter whether it is labeled "SYN" or not. Any function can (in theory) access the TyDb's and parse tokens. From the perspective of a spor function, there is no difference between asmTime and immTime. The only distinction for any function in fngi is if (and only if) it takes the asImm variable. The only thing that variable does is signal to the function whether it should be assembling stuff or executing stuff.

Sidenote: we will probably mark some variables/functions to prevent them being compiled to native code, since most programs won't be trying to parse fngi tokens after they are compiled to x86. Fngi really needs dead-code analysis. For most programs, there will be a lot of "dead code" that was only used at imm-time but should not be assembled to x86.

So, this means:

There is no such thing as a "runtime working stack" and a "compile time working stack" -- there is only one working stack, which is the working stack of the current fiber (and so far there is only one fiber). The same is true for the return stack -- there is only one. The only thing that is different between asm/imm times is the TyDbs.
Pre-pending a function with imm# causes it be executed immediately -- but the function itself won't know it is being executed immediately (only syn functions know, because the compiler passes them asImm at the top of the working stack).

Syn Functions

Syn functions are the bread-and-butter of fngi. Practically all "syntax" is written in fngi, including (...), if / elif / else, blk / while, var, struct, fn, etc.

Syn functions are just regular functions that have a SYN bit set and have the type signature fn mySynFn cstate:U2 -> Void do (...). When the compiler encounters a SYN token (a token who's name points to a SYN function when looked up) it pushes the cstate to the working stack (as you say -- recursively passes it) and executes it. A syn function may in-turn execute non-syn functions. Those functions will use the same working stack and have access the same global variables as the syn function does.

Could you have fngi without syn functions? I suppose you could implement all these things as compiler intrinsics, but then you wouldn't be able to extend the language by implementing syn functions -- and that's kind of the point :D

orbitaldecay commented 1 year ago

I'm going to be a little pedantic because I don't feel much closer to understanding what the big picture is. Let's focus on definitions at the moment since this is certainly a foundational point.

Compiler: the program that runs on fngi source code which has a set runtime: it starts when the user invokes the compiler program and ends when all source-code tokens have been compiled.

What is the input of the compiler?

What is the output of the compiler? Compile (verb: compiled, compiling, etc): the action of the compiler -- nothing else. We won't be using this word very much anymore, since it often makes things confusing :D

Does "compile" refer to the act of transforming compiler inputs to compiler outputs (see "compiler")? I'm not sure how we can talk about a compiler without having a clear definition of what we mean by compile? assembled: the act of writing spor assembly to memory, typically to the currently assembling function. Something executed (in the future) in the currently assembling function would be executed at "assembly time" or "asmTime".

When "assembling", spor assembly is written to memory. Is some transformation performed?

Traditionally, the word assemble is reserved for the act of converting assembly (human readable) to some machine code (binary format). Is such a transformation performed when we "assemble"?

Where does spor bytecode fit into this? Is the compiler actually generating spor assembly, or spor bytecode?

What is the "currently assembling function"? immediate: causes the compiler to execute a function instead of assembling it. Something executed at such a time is executed in "immediate time" or "immTime".

Is the action of the compiler to "compile", to "assemble", or both?

Does the compiler "compile" immediate functions and then execute them?

How does "immediate time" differ from the more common phrase "compile-time"?

Hopefully clarifying some of this will get us started on the right path!

orbitaldecay commented 1 year ago

One other point that I am quite confused about:

There is no such thing as a "runtime working stack" and a "compile time working stack" -- there is only one working stack,

Can fngi (theoretically) be compiled to native code? If so, I do not understand how it is possible that the working stack used by the native code is the same working stack that was used by the compiler to produce the native code (which may not even exist on the same machine).

vitiral commented 1 year ago

compiler input? source files. Eventually also a build config (which specifies the source files)
compiler output? currently: the spor assembly and TyDicts. Also: successful execution of specific tests. Eventually: the above plus an executable elf file or linkable object file.

Does "compile" refer to the act of transforming compiler inputs to compiler outputs (see "compiler")? I'm not sure how we can talk about a compiler without having a clear definition of what we mean by compile?

"compile" means assembling or executing tokens.

When "assembling", spor assembly is written to memory. Is some transformation performed?

tokens are processed as we discussed (literals, variables, function calls, etc). These are transformed into spor bytecode and written to the current codeBuf -- which is reserved (among other places) by the fn syn function.

No other transformation happens to the bytecode itself (yet: eventually we would want to compile spor -> x86 or other hardware assembly).

Traditionally, the word assemble is reserved for the act of converting assembly (human readable) to some machine code (binary format). Is such a transformation performed when we "assemble"?

Perhaps a better term should be used then. We are creating spor bytecode -- so we could call it "encoding"?

Where does spor assembly exist prior to being in memory?

I'm confused by the question. Spor assembly is defined by the VM (which is mostly defined in executeInstr). It is also documented in src/spor.zoa.

What is the "currently assembling function"?

Functions are compiled one token at a time. When you define the function

\ returns 8 if input != 0, else returns 15.
fn foo stk:S -> S do (
  if(\stk) do 7
  else      do 14
  + 1
)

The compiler will do the following with scanning:

encounters fn, look it up and see that it is a syn so it will execute it with asImm=false
fn is a syn function. It is defined to get one token as the name of the function. It then does a function signature which is <input types> -> <output types> do (do ends the signature).
fn then sets a global variable to the "currently compiling function". It also reserves (default) 512 bytes from the allocator and sets codeBuf as the place where the compiler encodes/assembles spor bytecode. It also sets up the TyDbAsm, etc.
fn then compiles one token (calls Kern_compFn) as the "function body". That token is ( which is a syn function you are familiar with and ends at ).
After compiling the one token, it does cleanup: de-allocating any remaining memory, setting the "currently compiling function" back to the previous value (which may be NULL), etc.

So in the above example, foo is the "currently assembling function" until the end of ) (until fn has "assembled one token")

Is the action of the compiler to "compile", to "assemble", or both?

The compiler compiles, which is: scan a token and then either execute or assemble it depending on the token and asImm

Does the compiler "compile" immediate functions and then execute them?

This is the type of terminology I'd like to get away from. I'm not sure I know what you mean.

How does "immediate time" differ from the more common phrase "compile-time"?

I think they are probably the same, although the terminology gets very confusing. Functions that were literally just assembled/encoded can be immediately executed. Also from a function's point of view there is no difference.

Can fngi (theoretically) be compiled to native code? If so, I do not understand how it is possible that the working stack used by the native code is the same working stack that was used by the compiler to produce the native code (which may not even exist on the same machine).

I'd say spor can be compiled to native code -- but some functions/variables/etc would probably need to be marked as immediate-only (they would be dis-allowed from being converted to native code).

Spor is just bytecode: it does operations on the working stack, calls function, accesses globals, etc. It is only some of the globals that might not exist on the target machine -- for instance the source-code file obviously won't exist on your webapp. Neither will the TyDb's, etc.

Think of it this way: let's say you define a Stk data structure and methods in fngi/spor. Can you execute that at "compile time", at "run time" or both? Both obviously -- presuming you have a pointer to your Stk type then it is a valid data structure wherever/whenever it is run.

orbitaldecay commented 1 year ago

Thank you so much for your comprehensive response. I think it may be appropriate to reevaluate the language we are using to communicate.

So that you can better understand the source of my confusion, I will articulate my understanding of the traditional definitions of the words we are defining (or redefining). I will assume the definitions of the terms program and execute are well defined (I don't know how to define these in a short amount of space). Note that by program I do not necessarily mean an executable file (though it is often overloaded to also mean this). A subroutine is also a program. The important thing is that a program is an algorithm expressed concretely in a specific language.

compiler (noun). A compiler is a program that receives another program in a source language as input and produces a program in a target language as output while maintaining equality between the semantics of the input and output programs (e.g. a C compiler takes a program in the C programming language and produces an equivalent program in x86 machine code).
compile (verb). To compile is to execute a compiler.
assembly (noun). A human-readable language which is extremely "low level". This is classically ill defined, though I think you have a sense of what I mean.
machine code (noun). A non human-readable language which is intended to be executed directly by hardware. Shares many similarities to bytecode, though bytecode is typically executed by a virtual machine.
assembler (noun). A specific type of compiler for which the source language is an assembly language and the target language is a machine code (or sometimes a bytecode).
assemble (verb). The act of executing an assembler.

The process of converting from the fngi language to spor bytecode I would call compiling in the above sense. I would probably use the term evaluate or process (verb) to refer to the act of taking a fngi token and either compiling it (using my language) or executing it. I think the way we have been using the word execute is congruent with the above definition.

I'd say spor can be compiled to native code -- but some functions/variables/etc would probably need to be marked as immediate-only (they would be dis-allowed from being converted to native code).

Compilation (by the above definition at least) is transitive. That is, if fngi can be compiled to spor, and spor can be compiled to machine code, then fngi can be compiled to machine code. In that case, how does the working stack used by the fngi compiler relate to the working stack used by the machine code (in a semantic sense, at least)?

vitiral commented 1 year ago

I think all of the above looks good. The only change I would request is that the fngi compiler evaluates a token and then does one of two things: encodes it to spor (instead of "compiling" as you suggested) or executes it. "compile" becomes too overloaded, and what we are doing is "encoding" spor bytecode to codeBuf.

Compilation (by the above definition at least) is transitive. That is, if fngi can be compiled to spor, and spor can be compiled to machine code, then fngi can be compiled to machine code. In that case, how does the working stack used by the fngi compiler relate to the working stack used by the machine code (in a semantic sense, at least)?

Hmm... from the machine-code-compiler's point of view there is no working stack except conceptually. In particular, if the machine code is register-based, then it must convert what spor sees as a "working stack" into registers and temporary values. If the target machine's code is stack-based then it would keep a kind of conceptual "working stack" inside of it's basic-blocks.

orbitaldecay commented 1 year ago

I think all of the above looks good. The only change I would request is that the fngi compiler evaluates a token and then does one of two things: encodes it to spor (instead of "compiling" as you suggested) or executes it. "compile" becomes too overloaded, and what we are doing is "encoding" spor bytecode to codeBuf.

I think we should think very carefully about using language in a way that is not congruent with the way English speaking programmers use it. It may be that my understanding of some of this language is incorrect, which I am also happy to talk about. But, for the purposes of this conversation I am happy to use the word encode.

Hmm... from the machine-code-compiler's point of view there is no working stack except conceptually. In particular, if the machine code is register-based, then it must convert what spor sees as a "working stack" into registers and temporary values. If the target machine's code is stack-based then it would keep a kind of conceptual "working stack" inside of it's basic-blocks.

Let’s back up a bit. I think we mean two different things when we say “working stack”.

When I say “working stack”, I am talking about what you are calling “conceptual working stack”. I am not talking about something that is implemented in a programming language. I am talking about a semantic concept. That is, a piece of a mathematically well defined abstract machine. We’ll call this abstract machine a stack machine. It does not physically exist.

I am trying to describe the semantics of the fngi language precisely by describing what a fngi program does to a stack machine. We could describe the semantics of a fngi program in relation to a register machine, it would just be a more complex description.

When I say the run-time working stack, I am talking about the stack associated with a particular semantic interpretation of the program that a fngi compiler (or encoder) outputs. It may not explicitly “look like” a stack in it’s implementation. It’s just a concept I’m using to explain the behavior of the program.

orbitaldecay commented 1 year ago

One more comment for the night as I'm reading over our conversations, reading over the source code, and trying to figure out what is going on. Hopefully this conversation is useful to other people who are trying to understand what fngi is all about.

I think a considerable source of confusion stems from the term "fngi compiler". The "fngi compiler" isn't a compiler, it's a combo execution environment / VM. The compiler is one small piece of this. In the FORTH context these are usually called FORTH systems, or FORTH environments, or just FORTHs. I think a similar linguistic convention around fngi would make things a lot easier to understand.

Now that I understand this, I understand what you are trying to say about the compile-time working stack and the run-time working stack being the same stack. This statement is not generally true; you could theoretically compile code written in the fngi language to a native executable that runs independently without ever invoking a fngi environment. But, there is no actual distinction between "compile-time working stack" and "run-time working stack" in the fngi environment because there is no real distinction between compilation and execution in the fngi environment. Those two steps are deeply intertwined (as in a FORTH). Compilation and execution are occurring in the same environment at the same time (roughly speaking), so they share the same "working stack".

This isn't a property of the fngi language, or of a fngi compiler. It's a property of the fngi environment. This is all very similar to how FORTHs operate and is not a dramatic departure from FORTHs I've written. The biggest difference is that the fngi environment contains a compiler that compiles to spor bytecode and also contains a spor vm to run it, rather than compiling to machine code.

vitiral commented 1 year ago

I think you've nailed it. Sorry this has been so difficult: I had such a hard time explaining this, even though you have a large breadth of experience (having implemented multiple FORTHs). I'm very grateful to you for sticking through. Please let me know if the below makes more sense.

The fngi system

The fngi system, or you could say just fngi parses tokens and either executes them or compiles (formerly: encodes) them to spor bytecode. We will refer to it as just "fngi" unless we need to disambiguate between some other aspect of fngi (like "fngi syntax").

spor is not a system, but is rather a bytecode targeting an abstract stack machine. fngi includes a VM for spor's abstract stack machine: which we will call the fngiVM. Along with maintaining a working and return stack, the fngiVM has access to compiler intrinsic functions and global variables. This allows fngi to immediately execute functions which were only recently compiled into spor bytecode. It also allows those functions to access compiler intrinsics: in other words, functions written in fngi syntax can act as the fngi system: they scan and then either compile or execute tokens. syn functions are the bread and butter of this system.

The fngi system includes:

A VM for spor's stack machine, including a working and return stack, etc
compiler-intrinsic functions: scan, findTy, etc
compiler-intrinsic global variables: codeBuf, imm+comp TyDb's, TyDict's, etc
fngi/spor repl and debugger
etc

Assembling and bootstrapping

I think we might have the language to explain assembling machine code, as well as bootstrapping.

When we assemble spor to machine code, it will never include the fngi system. The final program will not be able to parse tokens or check the type stack once the spor bytecode has been fully assembled into machine code -- or anything else that is a compiler intrinsic.

Sidenote: we may allow you to inspect types in an assembled program, although I lean towards not allowing this. I do want to include debug symbols so that a debugger could inspect types though.

If we are bootstrapping fngi (aka using fngi to implement fngi), we are still not including the "current" fngi system. Instead, we are re-implementing the fngi system in fngi -- the output program will have it's own fngi system, which may be implemented entirely differently than the fngi system that compiled it.