Compile-time Evaluation/Expressions (CTE)

evincarofautumn commented 10 years ago

I would like to split compilation into two phases:

Evaluation of Kitten code to produce Kitten code
Compilation of the resultant code to produce an executable

My current thinking is to have a word or sigil that denotes CTE, and allow it to be applied to any term—or at least those terms using words for which source is available. I propose #(term) to denote phase 1 and '(term) to denote phase 2—ignoring character literals for the moment. For example, the term #(1 + 2) in phase 1 is exactly equivalent to the literal 3 in phase 2, and #('3) is another way to spell that.

Broadly, there are three salient issues in my mind: the generation of code, the generation of types, and the role of the compiler.

It should be possible to write a phase-1 definition times which accepts the symbol dup and the value 3, and produces the phase-2 expression dup dup dup. This implies that a phase-1 expression should be able to produce a vector of terms which are spliced into the phase-2 program.

#('dup 3 replicate)
#(['dup, 'dup, 'dup])
dup dup dup

Maybe #['3] denotes splicing only one term, as shorthand for #(['3]).

When generating code, you want the ability to quote and splice, i.e., to switch phases.

#('[1, 2, #(1 + 2)])
#('[1, 2, 3])
[1, 2, 3]

You want the ability to generate types, so that you can compute interesting dependent types, and produce type signatures for code that you generate. For instance, a function such as printf may take an arbitrary number of arguments determined by its format string. You should be able to express that #("%d" printf) denotes a function that accepts one integer argument.

Rather than splice at every call site, it may be desirable to write definitions where some of the arguments are passed at compile time, i.e., (1 + runtime junk) #"%d" printf. I don’t know exactly what type signatures for such functions should look like. #132 has some thoughts on this.

What’s less clear is whether the AST you generate should be typed. Users of Template Haskell tend to lament the fact that you can produce nonsensical ASTs, but Haskell’s ASTs are also significantly more complex than Kitten’s. I think it should be untyped, that is, every term you manipulate in phase 1 should have type term (or expr, or bikeshed) regardless of its phase 2 type. Phase 1 code would still be statically typed, so #(1 + "two") would not compile.

The defining words such as define and data could be modelled as phase-1 functions, which happen to have the side effect of manipulating definitions and types used during phase 2. That is, define would be a word in the compiler API, and a phase-1 program would be a Kitten program that happens to use the compiler API to produce a Kitten program compiled in phase 2. I’m still pretty fuzzy on the specifics of how to do this, and whether it’s worthwhile.

CC @jeaye

jeaye commented 10 years ago

Having spoken with you about this in irc, I'm left with a couple remaining questions.

How about mutability during the first phase?
- Can a function in a CTE have mutable state? What are the limits?
What about calling functions from a CTE which are not yet linked?
- The module system is not yet complete, but will linking be changed by CTEs referencing symbols outside the current file/module? What are the limits?
We will need ways of interacting with phase 2 entities. Which of these will be provided out of the box?
- How can we allow others to write libraries for performing different tasks with/on the entities?
- Rust macros allow for "instantiations" similar to templates. Could phase 2 entities allow for type replacement and other sorts of meta-refactoring?

I'm ok with not having strong types during the first phase. With that said I do worry that it'll make error reporting more complex (and unusable a la C++ templates) when we get to the second phase. This may not actually be the case, but we should strive to make error reporting equally good/bad in both stages.

Without knowing more about Kitten's generics, it's tough to envision this code in use. For example, I'm thinking about how to go about implementing the analogous std::numeric_limits<T> using phase 1 constructs. Alas, this requires extensive usage of specialization: something not yet supported. So what about a higher order function? Also tough for me to say.

If these two systems (generics and CTE) aren't designed together, they'll surely affect each other after the fact. I believe we won't get a good CTE system without know how we want to use it for generics.

evincarofautumn commented 10 years ago

How about mutability during the first phase?

Mutability is another topic. It will be available at compile time if/when Kitten has it generally.

What about calling functions from a CTE which are not yet linked?

At first, we should require that source be available for definitions used in a CTE, much like C++ templates. We can relax that restriction in the future by compiling the phase-1 program, rather than interpreting it. That could improve compile times, as well, since the interpreter is not fast. However, I’m not sure it should be the default, because it has a lot of moving parts. In particular, unless/until we have a native backend, it introduces an interphase dependency on an external C compiler.

We will need ways of interacting with phase 2 entities. Which of these will be provided out of the box?

Basically, we should provide a Kitten transliteration of the compiler’s internal representation of terms (the TrTerm data type) and allow people to pattern-match on that however they choose. Then there might be a set of convenience functions, such as:

define vectorElements (term -> [term]?):
  match:
  case MakeVector -> terms:
    terms some
  default:
    none

I will need to think about the interaction between generics and CTE more. C++ templates conflate them, and that turned out to be a mistake which they’re only now trying to rectify with constexpr. I think I may be able to find an elegant implementation of generics in terms of CTE.

jeaye commented 10 years ago

It will be available at compile time if/when Kitten has it generally.

That's an interesting idea. Imagine a compile-time snippet that downloads some code, loads it from a file, and puts it into the binary. The package manager could be a compile-time step. hah!

I'm certainly more interested in the types of refactoring we could do at compile-time. This opens the door for some interesting reflection-based serialization or similar techniques.

evincarofautumn commented 10 years ago

For security reasons, I/O can’t be allowed at compile time by default. But it would be nice to be able to say define version { #("git describe --long --tags --always" system) }. Automatic serialisation would also be a good use case.

evincarofautumn commented 7 years ago

One problem with this is that it makes cross-compilation harder: you need to run compile-time code on the target, not on the host.

jeaye commented 7 years ago

That's a good point. I'd imagine that emulating the target's environment (pointer size, for example) would be possible, but likely not worth it. My guess would be that most uses of CTE won't be platform-specific.

Another idea, which is crazy, is to, when cross-compiling, not generate the final binary. Instead, generate a binary which, which executed, runs the appropriate CTEs and then does codegen. Effectively, curry the compilation process to defer CTE and codegen to when the binary is first run (likely by the package installer) in the target environment.

evincarofautumn commented 7 years ago

That’s not so crazy—we could ship bytecode with a compiler stub. A simpler and more portable (albeit slower) approach would be a threaded code backend.

trans commented 7 years ago

Isn't this basically the same idea as Forth's compiler mode? Maybe you can take some ideas from that. https://www.forth.com/starting-forth/11-forth-compiler-defining-words/

evincarofautumn commented 7 years ago

They’re pretty different—both are metaprogramming systems, but Kitten’s is a macro system, in that compile-time code will operate on a representation of the AST, while Forth defining words operate on the compiler state.

trans commented 7 years ago

I see. But could it still have a similar feel? Is all the excess quoting syntax really necessary?

#('[1, 2, #(1 + 2)])

It seems Lispy and clunky IMO (as macro syntaxes often do).

Instead can whatever is left on the compile time stack within a given CTE block get spliced into the AST. So the above could simply be

#([1, 2, 1 + 2])

You would still need a way to quote function words though, as in your first example with dup. Perhaps you could put the quote mark on the backside so it would read like a prime symbol.

#(dup' 3 replicate)

evincarofautumn commented 7 years ago

I would still want to maintain a distinction between terms and their representations, to avoid too much magic, but that’s a good idea to avoid quoting/splicing hell. In the same vein, we could treat the foregoing terms as a stack, so for instance, (f call) #dup would give (f call) (f call), and you would invoke a typed printf macro as just "…" #printf.

evincarofautumn / kitten

Compile-time Evaluation/Expressions (CTE) #141