Including a Gambit Scheme backend

abdelq commented 4 years ago

Gambit is a mature Scheme compiler and one of the fastest.

To give an idea, this is the time on average (after a few runs) for the execution of tests/chez/chez015/Numbers.idr (biased in favor of Chez).

~/W/Idris2 $ time ./idris2 --cg chez tests/chez/chez015/Numbers.idr -x main
[3518437212327232157160858338944, 1537557061787000452679295093927559, 3518437212327232157160858338070, 8051343735302590748651849744, 379]
[8650625671965379659, 5435549321212129090, 8650625671965378905, 365458446121836181, 357]

________________________________________________________
Executed in  464.05 millis    fish           external
   usr time  402.81 millis    0.00 millis  402.81 millis
   sys time   58.61 millis    3.25 millis   55.36 millis

~/W/Idris2 $ time ./idris2 --cg gambit tests/chez/chez015/Numbers.idr -x main
[3518437212327232157160858338944, 1537557061787000452679295093927559, 3518437212327232157160858338070, 8051343735302590748651849744, 379]
[8650625671965379659, 5435549321212129090, 8650625671965378905, 365458446121836181, 357]

________________________________________________________
Executed in  282.26 millis    fish           external
   usr time  237.35 millis    0.03 millis  237.33 millis
   sys time   43.03 millis    3.61 millis   39.43 millis

Another test with compilation involved (slower compilation time on Gambit):

~/W/Idris2 $ ./idris2 --cg chez tests/chez/chez007/TypeCase.idr -o chez-typecase
~/W/Idris2 $ ./idris2 --cg gambit tests/chez/chez007/TypeCase.idr -o gambit-typecase

~/W/I/b/exec $ time ./chez-typecase
"Function from Nat to Nat"
"Function from Nat to Vector of 0 Int"
"Function on Type"

________________________________________________________
Executed in  182.32 millis    fish           external
   usr time  151.08 millis    0.00 millis  151.08 millis
   sys time   30.13 millis    3.50 millis   26.63 millis

~/W/I/b/exec $ time ./gambit-typecase
"Function from Nat to Nat"
"Function from Nat to Vector of 0 Int"
"Function on Type"

________________________________________________________
Executed in   12.82 millis    fish           external
   usr time    3.11 millis    0.00 millis    3.11 millis
   sys time    8.81 millis    2.63 millis    6.18 millis

Gambit also has many backends, like JavaScript, which may be of interest for #276 (at least for early prototyping).

This is still a work in progress. I plan for parity with Chez.

edwinb commented 4 years ago

Interesting!

I'd be cautious of just the small benchmarks, but I'll be very interested to try this out on bigger samples. It'll also be interesting to see how compilation time compares - that gets quite important on programs the size of Idris, for example (the racket back end falls over on my self-hosting experiment, and chez takes about 30s, which is still a bit too much really).

I'm slightly wary of having to maintain lots of backends, even if they're all more or less the same, so fairly soon I'll try to make more progress on allowing pluggable back ends. But if this one turns out to perform better than chez, maybe it would be a better default.

(I should be clear that scheme as the default remains a temporary solution here, though!)

awson commented 4 years ago

self-hosting experiment

How far is Idris2 from it? Is there any extra information on self-hosting?

edwinb commented 4 years ago

@awson Idris 2 won't be self hosting until there's been enough work on compilation that it runs about as fast as the current version. I think that's not so much about the code generator, but rather what the code generator is given (Idris 1 does much more aggressive inlining and specialisation), although a good code generator will clearly help.

There's a few other bits and pieces necessary - for the experiment I had to take some shortcuts due to missing library functions, for example, mostly OS services. Possibly the most useful thing anyone can do to help is to contribute more of them, in fact. At least I now know that Idris 2 basically works well enough to build itself and run the result successfully, so that's a huge step forward.

I'm not going to make any promises or predictions on how long it'll take to get all the bits together, though, since other priorities come up, and also I want to use Idris 2 for other things than building itself for a bit!

edwinb commented 4 years ago

I've been playing around with this. There were a couple of fiddly things to sort out which I guess is why it's still a draft, but it does basically work. On performance, I've just tried this little example in Chez and this Gambit back end:

triples : Int -> List (Int, Int, Int)
triples top = [(x, y, z) | z <- [1..top], y <- [1..z], x <- [1..y],
                           x * x + y * y == z * z]

main : IO ()
main = do putStrLn "We're off"
          putStrLn (show (triples 200))

It's an interesting one because it runs for long enough that the startup time of Chez is less relevant, and because it has lots of closures, interfaces, and other common features. I get, for Chez:

real    0m0.246s
user    0m0.237s
sys 0m0.010s

and for Gambit:

real    0m0.846s
user    0m0.838s
sys 0m0.009s

I don't know why that's such a big difference, given the posted benchmarks. But then, the benchmarks refer to gambit/gerbil rather than gambit, and I wonder if that's relevant.

Even if it does turn out to be much slower for the Idris 2 use case (and it's more than it looks because of Chez's startup cost!) there are still two big advantages to supporting Gambit as well:

There is no startup cost (that's why I put the "We're off" in :).
In interpreter mode, it really is an interpreter, so no additional compilation cost, meaning you can try programs sooner and massively reduce the overhead in the edit/test/execute

So I think if this can do enough that it passes the same tests as the Chez generator, we should have it available.

abdelq commented 4 years ago

@edwinb I got somewhat similar results with your example (468.9 ms ± 6.0 ms with Chez, 676.0 ms ± 13.1 ms with Gambit). This is a bit surprising, and I will further look into it. Since it's generated code with a lot of nested closures and vector manipulation in the mix, it's not as easy to diagnose.

The benchmarks refer to gambit/gerbil, because Gerbil is built on top of Gambit, hence the performance is almost the same for both compilers.

It passes all non-FFI tests just fine. While trying tests, I noticed that in particular the array output of chez004 seems dubious in the case of Chez. Notice that setting the integer 94 on index 1 puts in on the second element of the byte vector, which doesn't make much sense (should be made into an issue?).

~/W/I/t/c/chez004 $ ../../../idris2 --cg chez Buffer.idr -x main                                                                                                                          (gambit)
100
94
94.42
"Hello"
"there!"
[0, 94, 0, 0, 0, 65, 65, 65, 65, 0, 123, 20, 174, 71, 225, 154, 87, 64, 0, 0, 72, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 94, 0, 0, 0, 65, 65, 65, 65, 0, 123, 20, 174, 71, 225, 154, 87, 64, 0, 0, 72, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

~/W/I/t/c/chez004 $ ../../../idris2 --cg gambit Buffer.idr -x main
100
1094795614
94.42
"Hello"
"there!"
[0, 0, 0, 0, 94, 65, 65, 65, 65, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 72, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 123, 20, 174, 71, 225, 154, 87, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 94, 65, 65, 65, 65, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 72, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 123, 20, 174, 71, 225, 154, 87, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

As for FFI tests, I was able to run chez013 by doing the following:

Naming the structs


-typedef struct {
+typedef struct point {

-typedef struct { +typedef struct namedpoint {


2. Manually adding a `(c-declare "#include \"struct.h\"")` to the generated Scheme file.

3. Compiling with the following command (with `libstruct.so` compiled beforehand):
`gsc-script -exe -ld-options "-L. -Wl,-rpath,. -lstruct"`

I've been able to run `chez010` too, but *with a hack*:
1. Manually modified generated code to look like this (`c-define` is top-level only):
```scheme
(c-define (int-fn cb0 cb1) (int int) int "intFn" "" (int-fn-func cb0 cb1))
(c-define (char-fn cb0 cb1) (char int) char "charFn" "" (char-fn-func cb0 cb1))
(c-define (int-fn-pure cb0 cb1) (int int) int "intFnPure" "" (int-fn-pure-func cb0 cb1))

(define Main-prim_applyIntFn
  (lambda (farg-0 farg-1 farg-2 farg-3)
    (set! int-fn-func (lambda (cb0 cb1) (((farg-2 cb0) cb1) #f)))
    ((c-lambda (int int (function (int int) int)) int "applyIntFn") farg-0 farg-1 int-fn)))
(define Main-prim_applyCharFn
  (lambda (farg-0 farg-1 farg-2 farg-3)
    (set! char-fn-func (lambda (cb0 cb1) (((farg-2 cb0) cb1) #f)))
    ((c-lambda (char int (function (char int) char)) char "applyCharFn") farg-0 farg-1 char-fn)))
(define Main-applyIntFnPure
  (lambda (farg-0 farg-1 farg-2)
    (set! int-fn-pure-func (lambda (cb0 cb1) ((farg-2 cb0) cb1)))
    ((c-lambda (int int (function (int int) int)) int "applyIntFnPure") farg-0 farg-1 int-fn-pure)))

Compiling with the following command (with libcb.so compiled beforehand): gsc-script -exe -ld-options "-L. -Wl,-rpath,. -lcb"

I spent a sizeable amount of time to generate the proper ld-options, c-define etc. without much success. If someone more proficient with Idris can step in to solve the issues above, I would very much appreciate it.

I will be very soon putting this up for review and merge as is, since it's pretty much on the same level as the Racket backend. If I have more time on my hands, I will get back and tackle the aforementioned issues.

edwinb commented 4 years ago

Great, nice to see this is making progress! On the performance thing, it's hard to draw any conclusions from just one test, but I like the triples one exactly because it is a bit of a challenge. I have one theory as to what might be going on (but I will test it before I talk rubbish here :)).

On the buffer thing, 94 is in the right place. I'd expect it at the second position because the buffer is zero based. I have some confidence that they work fine because I'm using them extensively while experimenting with self hosting Idris 2.

edwinb commented 4 years ago

I did some hand-inlining of the various bits of the triples example and found performance much closer to chez (though still a touch slower). That's something that Idris will do a bit more of itself in the not too distant future. I wonder if that suggests chez is better at dealing with lambdas? In any case, I think that gives a bit more information about how the performance will compare in general.

feeley commented 4 years ago

I did some hand-inlining of the various bits of the triples example and found performance much closer to chez (though still a touch slower). That's something that Idris will do a bit more of itself in the not too distant future. I wonder if that suggests chez is better at dealing with lambdas? In any case, I think that gives a bit more information about how the performance will compare in general.

Gambit uses various heuristics to decide if a particular function should be inlined. Perhaps the generated Scheme code is in a style that is not ideal for triggering the inlining. You can increase the level of inlining by adding a (declare (inlining-limit 500)) or a greater number, so that bigger functions will get inlined (500 means 500% code expansion). Also, it is useful to call the gsc compiler with the option -expansion to see how the inliner has transformed the code. You can also observe missed optimization opportunities that way (such as dynamic tests in the code). If in doubt send me the generated Scheme code and I will have a look.

edwinb commented 4 years ago

@feeley Thanks! I'd be astonished if any Scheme compiler spotted the inlinings I did by hand :). It's about removing layers of abstraction in interfaces and higher order functions, which Idris 1 knows how to do, but Idris 2 doesn't yet, though I am implementing it. The Scheme we generate is unconventional, to say the least, but once we have something up and running for Gambit it'd be interesting to know what you make of it, just in case there's things we could do to take better advantage of it.

feeley commented 4 years ago

Here are the suggestions I have so far...

It seems your use of the Scheme case expects it to be exhaustive. I expect the Idris 2 front-end to only have exhaustive cases. So when there is a single case you could generate the code

<body>

instead of

(case <expr> ((0) <body>))

and when there are N>=2 cases, then the last one should be an else.

I've had a factor of 2x speedup by replacing b+ by fx+ and I don't quite understand how integer arithmetic works in Idris 2. Are integers of a fixed 63 bits of precision? Would it be ok to use 62 bit integers (Gambit uses 2 bit low-end tagging for fixnums)? Currently this is the main performance bottleneck. Adding two integers is done with the very inefficient:

(define b+ (lambda (x y bits) (remainder (+ x y) (expt 2 bits))))

which computes 2^63 (a bignum) on every addition, and a (possibly bignum) addition of x and y, and a bignum remainder operation.

I also see a lot of currying and this causes a lot of allocation. Not sure if this is inherent in the way Idris 2 works, but clearly a possible source of inefficiency worth trying to remove.

edwinb commented 4 years ago

The reason for b+ was that chez threw an exception on overflow, but the intention is to wrap around (kind of ugly, but the run time intentionally doesn't have exceptions). I guess we could achieve the same behaviour with better performance with fx+ somehow though?

I'll try the other ideas - in particular, there are several things we can do to reduce the number of closures that I just haven't got around to doing yet, and that were very effective in Idris 1.

feeley commented 4 years ago

Another option would be to simply use (+ x y) instead of (b+ x y 63). This will be relatively fast when x and y are fixnums (and the result fits in a fixnum) and will automatically use the slower bignum operations otherwise. That way the result is actually mathematically correct and not truncated at 63 bits which seems arbitrary. With this change, on my machine, the triplets benchmark runs in 1.91 secs with Gambit and 1.95 secs with Chez Scheme (using a range of 500 instead of 200). I also tried with Gambit's JavaScript backend (gsc -exe -target js ...) and executed with nodejs and it took 30 secs (about 15x slower than compiling through C). Finally I tried the Python route (gsc -exe -target python ...) and it took 1062 secs (about 550x slower than compiling through C!). CPython is not exactly known for its speed!

feeley commented 4 years ago

Gambit detects fixnum overflow and divide by zero and raises an exception in those cases (unless you explicitly (declare (not safe)) which tells the compiler it can assume there are no errors in the code):

> (fx+ (expt 2 60) (expt 2 59))
1729382256910270464
> (fx+ (expt 2 60) (expt 2 60))
*** ERROR IN (stdin)@4.1 -- FIXNUM overflow
(fx+ 1152921504606846976 1152921504606846976)
1> (with-exception-catcher pretty-print (lambda () (fx+ (expt 2 60) (expt 2 60))
#<fixnum-overflow-exception #2>

If you actually want fixnum arithmetic to wrap, there are variants of the fixnum procedures that don't check for overflow:

> (fxwrap+ (expt 2 60) (expt 2 59))
1729382256910270464
> (fxwrap+ (expt 2 60) (expt 2 60))
-2305843009213693952

edwinb commented 4 years ago

Thanks for all the suggestions @feeley! I think the fxwrap versions are what we want for Int. But we do need to revisit what happens with the fixed precision integer types anyway.

Thanks for the contribution @abdelq! I'll merge this now, and I think it's probably time to drop Chicken (I don't want to have to maintain lots of similar but not quite identical targets after all).

edwinb commented 4 years ago

There are some small things that I have to hack on my local version to make this work, and I wonder if this is about the gambit scheme version (I'm on 4.8.8 from Ubuntu LTS):

gsc-script tries to run the script rather than just compiling it (the man page says this is what it'll do though)
It complains about #\null but #\nul is fine
#u8() doesn't work, but (u8vector) does
Some other things not working in support.scm mostly related to threads and system commands

Do I just need a newer gambit? I don't mind setting a version requirement since it's not the default back end, although supporting older versions might be nice if it doesn't limit what it can do.

feeley commented 4 years ago

There are simple fixes for all of these...

1) #\null -> #\nul or #\x0 2) #u8(...) -> '#u8(...)

I'll let @abdelq work out these glitches...

edwinb / Idris2-boot

Including a Gambit Scheme backend #317