Gabriella439 / Haskell-Morte-Library

A bare-bones calculus-of-constructions
BSD 3-Clause "New" or "Revised" License
373 stars 25 forks source link

High-efficiency primitive operations in `morte` #49

Closed Gabriella439 closed 8 years ago

Gabriella439 commented 8 years ago

Alright, so I finally have a concrete roadmap for how to handle high-efficiency operations in morte.

tl;dr: The main change I'm proposing is to extend morte's syntax tree to store primitive literals like Natural and Text

First, I'll explain how primitive support will be implemented and then I'll explain the other alternatives I considered and why I rejected them.

The goal

I plan to provide a command-line nordom tool and library that is almost identical to the morte command-line compiler. nordom the command-line tool will be just like the morte command-line tool plus built-in support for a large standard library of high-efficiency operations. nordom the library will provide a way to efficiently convert between morte values and their equivalent Haskell values so that morte can interoperate with Haskell without a separate compile step.

Keeping with the Planescape: Torment references, this tool will be named nordom because it is "close to the machine" (and Nordom was a character from the plane of Mechanus, albeit "gone rogue" and not truly aligned with the plane).

I've already written a prototype of this nordom tool with a small standard library and I'm now in the process of cleaning it up, extending the standard library, and making the minimal changes to morte necessary to support the tool.

For example, using this tool you will be able to write:

$ ./nordom
Text/replicate (Natural/(*) 10 10) (Text/(++) "abc" "def")
$ ./nordom
Text/replicate (Natural/(*) 10 10) (Text/(++) "abc" "def")
"abcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdefabcdef"

... and it will run efficiently. On my machine the above program takes less than 4 milliseconds end-to-end on my machine and most of that is just the overhead of starting up the Haskell runtime.

Changes to morte

I've identified the minimal changes necessary to make to morte in order to support high-efficiency foreign operations that support Natural numbers and Text

Those would be the only two changes to morte necessary to implement nordom. Note that we will not be adding primitive functions to morte's syntax tree.

Note that there might be more than two primitive types to add to morte's syntax tree. For example, Haskell's Vector/ByteString/Integer/Scientific/Char/Word8 types are all pretty strong contenders for primitive types worth supporting..

How nordom will work

Probably the easiest way to show how nordom is designed to work is to paste some code excerpts from my nordom prototype.

First, nordom creates a Context that has the names and types of all supported operations:

ffi :: Context (Expr a)
ffi =
        Context.empty
    &   Context.insert "Text/(++)"
            (Pi "_" Text (Pi "_" Text Text))
    &   Context.insert "Text/replicate"
            (Pi "_" Natural (Pi "_" Text Text))
    &   Context.insert "Text/length"
            (Pi "_" Text Natural)
    &   Context.insert "Text/drop"
            (Pi "_" Natural (Pi "_" Text Text))
    ...

This context is used to type-check all nordom code.

Then nordom defines an interpret function which is identical to morte's normalize function except with additional cases for each supported function:

interpret :: Expr X -> Expr X
interpret e = case e of
    -- This part is identical to `Morte.Core.normalize`
    Lam x _A b -> case b' of
        App f a -> case a of
            Var v' | v == v' && not (v `freeIn` f) ->
                shift (-1) x f  -- Eta reduce
                   | otherwise                     ->
                e'
              where
                v = V x 0
            _                                      -> e'
        _       -> e'
      where
        b' = interpret b
        e' = Lam x (interpret _A) b'
    Pi  x _A _B -> Pi x (interpret _A) (interpret _B)
    App f a     -> case f' of
        Lam x _A b -> interpret b'  -- Beta reduce
          where
            a'' = shift 1 x a'
            b' = shift (-1) x (subst x 0 a'' b)

        -- This part is where we interpret foreign functions
        Var (V "Text/length" 0) -> case a' of
            TextLit t -> NaturalLit (fromIntegral (Text.length t))
            _ -> App f' a'
        Var (V "Natural/show" 0) -> case a' of
            NaturalLit n ->
                ( TextLit
                . Builder.toLazyText
                . Builder.decimal
                ) (fromIntegral n :: Integer)
            _ -> App f' a'
        App g b -> case g' of
            Var (V "Text/(++)" 0) -> case b' of
                TextLit tL -> case a' of
                    TextLit tR -> TextLit (tL <> tR)
                    _          -> App f' a'
                _ -> App f' a'
            Var (V "Text/replicate" 0) -> case b' of
                NaturalLit n -> case a' of
                    TextLit t -> TextLit (Text.replicate (fromIntegral n) t)
                    _         -> App f' a'
                _ -> App f' a'
        ...

... and all nordom does is type-check code, run it through the interpreter, and then pretty-print it.

Alternative 1 - Encode everything in lambda calculus

I pursued this alternative for the longest time and played a lot of tricks with encodings but at the end of the day it's not realistic. Text/ByteString were the worst-behaved of all the primitive types to encode because they were expensive to type-check, normalize, and store in memory.

The benchmark I used for how realistic this approach would be was how fast I could concatenate two strings of size N. I could never go faster than 1000 characters per second using pure lambda calculus, which is less than a terminal's window worth of text.

Alternative 2 - Transmit primitive literals out-of-band

One approach I considered was to provide a separate data payload containing all primitive literals. I wasn't very happy with this approach because I wanted to avoid out-of-band information in general. morte was partly a reaction to compilers with very complex architectures and distribution models so I preferred to keep morte expressions as atomic and self-contained as possible.

The more moving parts morte code has the more difficult it is to embed morte in unusual places. For example, I would like it to be easy to embed morte expressions as fields in configuration files or query parameters in API requests and keeping literals compact and in-band greatly simplifies the distribution model.

Alternative 3 - Explicitly introduce foreign operations via lambda abstraction

... as opposed to providing them via a global default context.

There are a couple of ways to do this, but the front-runner was encoding foreign operations as a syntax tree encoded in lambda calculus using the Boehm-berarducci encoding.

The thing that killed the viability of Boehm-Berarducci encoding for this purpose was the inability to model polymorphic functions (like map) in the embedded foreign language.

A secondary issue was that the performance of this encoding degrades for either large syntax trees or a large number of foreign functions. Explicitly passing around all the placeholders for foreign functions around is expensive, for almost the exact same reason that encoding string operations in lambda calculus is expensive. If you squint, the Boehm-Berarducci encoding of a syntax tree with a large number of foreign functions is very similar to the encoding of a string with a large number of possible characters.

Passing around a global context is much easier and more efficient by comparison and requires a very mild extension to the morte API: we just generalize the type of the load function. This also generates much cleaner-looking raw morte code since you no longer need to introduce foreign functions via abstraction.

Alternative 4 - Add foreign functions to the syntax tree, too

I didn't do this because there was no point. There is no efficiency benefit to adding them to the syntax tree and it needlessly increases the surface area of the morte API. The more complex the morte implementation the more difficult it is to fork or port to other languages and platforms.

It's also actively harmful to include foreign functions in the syntax tree because we don't want to standardize the set of foreign functions. We want to be able to support multiple backends and each backend may want to a different set of foreign functions.

Discussion

I'll give a two week discussion period for this in case anybody objects to the above changes or wishes to propose an alternative approach (or revive an approach which I rejected).

For people who approve of this approach, it's also worth discussing which primitive types should and should not be supported. My only guideline is that we should not add types to the syntax tree if they can be reasonably encoded in lambda calculus (like Maybe). I picked ByteString/Scientific/Char/Word8/Vector as candidates for primitive support because they are expensive to encode otherwise.

VictorTaelin commented 8 years ago

OK, rewriting that - I'm not exactly against this approach for solving this problem, but I'm somewhat against the need to solve it. Why do you want Morte to produce efficient executables, today, just for curiosity? Lambda-calculus structures are only a constant (3~100x) away from Haskell's performance anyway.

I've always seen Morte as that project on which we would just forget about immediate real-world efficiency and write the algorithms in our platonic language, in the most "natural" way. Higher performance would be achieved by other means, such as different computer architectures and better runtimes... which would, by turn, be possible because of the very simplicity of the core language.

Gabriella439 commented 8 years ago

Do you have benchmarks for Caramel? I'm mostly interested in the cost of concatenating and normalizing strings of bytes because that's the one data type and operation I could not get good constant factors on no matter what encoding I used. The benchmark I've been using is concatenating two 1 kB byte strings and then normalizing the result.

VictorTaelin commented 8 years ago

Give me a second.

VictorTaelin commented 8 years ago

@Gabriel439 I'm not sure I understand what benchmark you mean. Just concatenating two strings with 1000 bytes? I've made this type for bytestrings:

-- Creates a byte string with 1000 identical bytes
n_byte_string n = sample_list
    sample_list = (replicate n sample_byte)
    sample_byte = (B0 (B1 (B0 (B0 (B1 (B1 (B0 (B0 BEnd))))))))
    B0   = (x a b c -> (a (x a b c)))
    B1   = (x a b c -> (b (x a b c)))
    BEnd = (a b c -> c)

I also used concat:

concat a b cons nil = (a cons (b cons nil))

So, this program:

main = (concat (n_byte_string 1000) (n_byte_string 1000))

Runs instantly on the caramel interpreter. Benchmarking it to further details, I used this Optlam script:

optlam = require("./optlam.js");
lambda = require("./lambda_calculus.js");
var A  = lambda.App, L  = lambda.Lam, V  = lambda.Var, n  = lambda.nat, n_ = lambda.nat_;
function n(x){ var t = V(0); for (var i=0; i<x; ++i) t = A(V(1),t); return L(L(t)); };

var n_byte_string   = L(L(L(A(A(V(2),A(V(1),L(L(L(A(V(2),A(V(1),A(V(2),A(V(2),A(V(1),A(V(1),A(V(2),A(V(2),V(0)))))))))))))),V(0)))));
var concat          = L(L(L(L(A(A(V(3),V(1)),A(A(V(2),V(1)),V(0)))))));
var n1000           = A(n(3), n(10));
var byte_string_1kb = A(n_byte_string, n1000);
var main            = A(A(concat, byte_string_1kb), byte_string_1kb);
var norm            = optlam.reduce(main);

console.log(lambda.pretty(norm));
console.log(JSON.stringify(optlam.stats));

And I got this result:

{"iterations":1504,"applications":224,"betas":86,"used_memory":1818}

224 applications means the normal form was reached with merely 224 graph rewrites. That is really low. I guess I understand your benchmark wrong?

Edit: in fact, even concatenating two googolbyte strings to normal form take only 10k applications and runs instantly, lol. See this program.

int-index commented 8 years ago

Morte's minimalistic syntax tree is one of its key advantages. I am against cluttering it with non-extensible primitive types.

Did you consider adding them via Embed?

data Lit
    | TextLit Text
    | Text  -- The type of text literals
    | NaturalLit Natural
    | Natural  -- The type of natural numbers

nordom :: Expr Lit -> Expr Lit
nordom = ... -- This part is where we interpret foreign functions

interpret = normalize . nordom

You can have both Lit and Path using Either.

Gabriella439 commented 8 years ago

@MaiaVictor The main reason for the performance speedup there is because the graph rewriting algorithm is exploiting sharing (which is still neat!). However, we can't reasonably exploit sharing in real-world strings that we concatenate. I would like to see concatenation of two random 1 kB strings.

@int-index Yes, using Embed is also an option. That's a nice idea!

Gabriella439 commented 8 years ago

Also, I'm not completely sure that optlam.js is a correct implementation of Lamping's algorithm (or the lambda calculus). It's the missing the control nodes

VictorTaelin commented 8 years ago

Concatenating two completely random 1 kB strings (see it raw) gives actually better results:

{"iterations":213742,"applications":18,"betas":18,"used_memory":414666}

Just 18 graph rewrites and you get the normal form. Yes, Optlam implements only the abstract part of Lamping's algorithm. I say that is enough for anything that matters, but lets not get on that point. You'll get similar performance on Anton's actually correct implementations, and Haskell itself did it instantly too. Even compiling to JavaScript I got the nf instantly, actually. Concatenating in general is really lightweight. In what scenario did it perform bad, exactly?

Gabriella439 commented 8 years ago

I think the number of beta-reductions is a misleading metric. The time the code takes to run is more important.

When I run the code it takes approximately 250 ms to run. I benchmarked your program and also benchmarked an empty program so I could subtract the overhead of running node:

$ git clone https://github.com/MaiaVictor/optlam.git
$ cd optlam.git
$ # create your example concat.js program
$ bench 'node concat.js'  # `bench` is from here: https://github.com/Gabriel439/bench
benchmarking node concat.js
time                 289.1 ms   (273.0 ms .. 307.9 ms)
                     0.996 R²   (0.978 R² .. 1.000 R²)
mean                 274.1 ms   (250.6 ms .. 283.4 ms)
std dev              16.67 ms   (2.112 ms .. 21.39 ms)
variance introduced by outliers: 17% (moderately inflated)

$ touch empty.js
$ bench 'node empty.js'
benchmarking node empty.js
time                 34.85 ms   (33.01 ms .. 36.87 ms)
                     0.978 R²   (0.938 R² .. 0.997 R²)
mean                 34.18 ms   (33.20 ms .. 35.89 ms)
std dev              2.669 ms   (1.456 ms .. 4.201 ms)
variance introduced by outliers: 30% (moderately inflated)

250 ms is about 10,000,000x-100,000,000x 10,000x-100,000x (my original math was wrong, I forgot to scale by the number of characters) slower than using primitive operations from Haskell for this purpose.

VictorTaelin commented 8 years ago

Beta reductions is misleading, I'm not using that. I'm counting all graph rewrite rules. The time the algorithm takes to run is proportional to the number of all graph rewrite rules and nothing else. Optlam is optimal but terribly inefficient, there is no point on measuring its run time. It is a JavaScript interpreter and it runs through the graph sequentially looking for active pairs. On that case, it spends most of the time searching active pairs by walking through the nodes. That is entirely unnecessary, you could just keep a pointer to them when you alloc the memory. Moreover, the readback to λ-calculus is almost always the bottleneck (as it needs to undo all the sharing that takes place!). A specialized interaction net machine wouldn't need such readback.

But this isn't about Optlam at all. Lets just compile to Haskell and do the benchmarks. How do that concat program compare to the same thing using native data structures? That's what we should be measuring. I'm not able to replicate it being much slower than native data structures. I'd like to see a benchmark showing the opposite, though, so I could investigate it further.

Gabriella439 commented 8 years ago

So the conclusion I came to was that it's better to keep morte as is. For high-efficiency primitives, I think it's wiser to fork off a separate language and compiler using morte as a base template than to reuse morte directly. I'll explain my reasoning for why.

I spent a lot of time investigating the possibility of extending morte's syntax tree via the Embed constructor. The issue that you run into is that you can't easily model foreign types and terms that are polymorphic when you Embed them this way. For example, there is no way to Embed a Vector type constructor this way. You would have to change Embed to this shape:

data Expr f = ... | Embed (Expr (f (Expr f)))

... which I would like to avoid.

Also, even if you don't want to model a polymorphic FFI and you keep Embed the way it is you still have to generalize every morte operation (i.e. parsing/normalization/type-checking/imports) to handle user extensions. I went through the exercise of generalizing morte like this and it greatly complicates the API and ruins morte's goal of being a clean and simple lambda calculus implementation.

I also agree with @int-index's sentiment that there are too many primitive types worth supporting to warrant extending morte's syntax tree.

So I'll keep morte true to the original spirit of being a minimalist lambda calculus, but I also still plan on spinning off nordom as a separate compiler that forks morte's logic because I have a specific application in mind for nordom that requires high-efficiency primitives. Also, all the derived compilers that I have in mind will be supersets of morte's syntax, so that any code that doesn't use foreign operations can still be shared with pure morte code

eikeon commented 8 years ago

Seems like using a Combinatory Categorial Grammar (CCG) would be a good fit with morte / nordom and in particular might turn this issue into a matter of what all terms are included in the lexicon. @Gabriel439 and all, thoughts?

http://yoavartzi.com/tutorial/ https://github.com/vlastachu/haskell-ccg

VictorTaelin commented 8 years ago

I'm so thankful for that choice. I honestly hope we will eventually figure out natural ways to evaluate λ-calculus programs, exploiting its informational simplicity in a way that we don't do yet. Having those information dense primitives might hinder our ability to fully exploit them.