Unexpected performance of coalton while running `(microbench1:run)`

coalton-lang / coalton

Coalton is an efficient, statically typed functional programming language that supercharges Common Lisp.

https://coalton-lang.github.io/

MIT License

1.12k stars 67 forks source link

Unexpected performance of coalton while running `(microbench1:run)` #985

Open digikar99 opened 1 year ago

digikar99 commented 1 year ago

Hello,

Something seems off regarding performance and inlining in the current master branch version of coalton, or may be I'm missing something.

I started SBCL with rlwrap sbcl --no-userinit --load quicklisp/setup.lisp:

* (asdf:load-system "coalton" :force t)
...
* (ql:quickload "small-coalton-programs")
To load "small-coalton-programs":
  Load 1 ASDF system:
    small-coalton-programs
; Loading "small-coalton-programs"
[package small-coalton-programs]..................
[package small-coalton-programs.primes-native]....
[package small-coalton-programs.primes-iterator]..
[package freecalc]................................
[package lisp-microbench1]........................
[package coalton-microbench1].....................
[package microbench1].
("small-coalton-programs")
* (microbench1:run)
Lisp took 37084 ms
Coalton took 25723 ms
Coalton was faster by 11361 ms (30.64%)
The hand-optimized function took 8020 ms
NIL

I expected coalton to take about the same time as the hand-optimized function, but it took significantly longer and seems closer to the lisp version than the hand-optimized version. (I also set the minimum and maximum clock speeds for my CPU at 0.8GHz.)

So, I checked the disassembly of the add-two-caller below, and it turns out to be not inlined:

(in-package :coalton-user)

(named-readtables:in-readtable coalton:coalton)

(coalton-toplevel
  (declare add-two (single-float -> single-float))
  (define (add-two x) (+ x 2.0)))

(cl:defun add-two-caller (x)
  (cl:declare (cl:optimize cl:speed (cl:safety 0) (cl:debug 0))
              (cl:type cl:single-float x))
  (add-two x))

(cl:disassemble 'add-two-caller)
; disassembly for ADD-TWO-CALLER
; Size: 13 bytes. Origin: #x54078956                          ; ADD-TWO-CALLER
; 56:       B902000000       MOV ECX, 2
; 5B:       FF7508           PUSH QWORD PTR [RBP+8]
; 5E:       E9DFCE36FC       JMP #x503E5842                   ; #<FDEFN ADD-TWO>

Am I missing something here - or is this a (known) bug?

stylewarning commented 1 year ago

The Lisp functions that Coalton makes aren't presently declared as inline.

Inlining in Coalton is a huge area of interest that we haven't tackled much at all, but it will ultimately be critical for high-speed, non-consing numerical code.

(Also, reminder about :coalton-release.)

stylewarning commented 1 year ago

Oh, and second, the performance of hand-optimized Lisp being better than Coalton is a known area of improvement. (I hesitate to call it a bug, because it's not a regression, and it still calculates the correct thing.)

digikar99 commented 1 year ago

Did you try building with :coalton-release on your *features*?

No, I hadn't - thanks for the pointer! Although, even with this, I don't see much performance improvement.

So, that leads me to two thoughts -

1. About the separation between development and release mode

Is there something that makes this separation necessary? Or is the merger something that is still to be thought and worked out?

Development Mode

Development mode allows most structures to be re-defined. This means:

Most types turn into CLOS classes.

Most types are not flattened or unwrapped in any way.

Certain optimizations that obscure debugging are disabled.

Release Mode

Release mode freezes most structures and optimizes them.

Most types are defined as frozen defstructs or flattened Lisp data types.

Many optimizations are performed.

If I understand the concern about type safety, I'm guessing that the frozen types during release mode aid type safety by ensuring that the types don't get redefined after their first definition. But on the other hand, this hinders the rapid prototyping capabilities of lisp which involve redefinitions upon redefinitions, which make the development mode a necessity too. Is there something else that makes the separation necessary?

But that said, type safety seems a bit independent to whether or not a function is available for inspection during run-time. Regarding the latter, it should be possible to (optimize debug) in both cases, so that the function calls (in coalton or in lisp) do not disappear from the call stack by inlining; or (optimize speed), so that the function calls may or may not remain on the call stack.

This should also be relevant to https://github.com/coalton-lang/coalton/discussions/978.

2. Inlining

The Lisp functions that Coalton makes aren't presently declared as inline.

Inlining in Coalton is a huge area of interest that we haven't tackled much at all, but it will ultimately be critical for high-speed, non-consing numerical code.

Yes, this would be a deal breaker for anyone who wants to provide coalton-based libraries to both coalton and non-coalton users :(. At the outset, I don't see anything inherent that might prevent inlining coalton code into lisp, so I might look into it in the upcoming weeks. I do suspect this is going to require compiler-macros - and perhaps cl-form-types too to optimize and inline the non-trivial cases, and thus the CLTL2 environment API. But okay, that's going too fast, and as a first step we do need to get inlining working in the trivial cases.

Oh, and second, the performance of hand-optimized Lisp being better than Coalton is a known area of improvement.

Hmm, interesting. Though I do expect it would be easier to optimize coalton than lisp. And perhaps the difference should vanish once the inlining gets fixed.

eliaslfox commented 1 year ago

Is there something that makes this separation necessary? Or is the merger something that is still to be thought and worked out?

The development/release mode divide isn't about type safety, it's about allowing both interactive development and performance. When you define a CLOS class, the accessors compile to something like:

(defclass point () 
  ((x :accessor point-x)
   (y :accessor point-y)))

(defmethod ((obj pointj))
  (slot-value point 'x))

Each field access pays the cost of a CLOS method lookup, and then the cost of slot-value. Structs on the other hand have a fixed layout which can make field access much faster.

CL-USER> (declare (optimize (speed 3) (safety 0)))
CL-USER> (defstruct point x y)

CL-USER> (disassemble 'point-x)
; disassembly for POINT-X
; Size: 20 bytes. Origin: #x700CFD6BF8                        ; POINT-X
; BF8:       4AD140F8         LDR R0, [R0, #13]

; BFC:       FB031AAA         MOV CSP, CFP
; C00:       5A7B40A9         LDP CFP, LR, [CFP]
; C04:       BF0300F1         CMP NULL, #0
; C08:       C0035FD6         RET

OTOH the fixed field layout of structs makes safely redefining them generally impossible. Compiled code that accuses fields needs to be recompiled if the location of a field changes.

I do suspect this is going to require compiler-macros - and perhaps cl-form-types too to optimize and inline the non-trivial cases

Toplevel Coalton functions compile to defuns. You can just inline them with (declare (inline function-name)).

eliaslfox commented 1 year ago

In this specific microbenchmark the problem appears that + and * for floats aren't being inlined in the Coalton version. We used to inline every directly called method but I removed it because it broke redefinition. Some of the base methods in the standard library should probably be explicitly marked inline.

digikar99 commented 1 year ago

The development/release mode divide isn't about type safety, it's about allowing both interactive development and performance.

Oh, I see. Or, erm, no, I don't.

I had mistook the structures and classes to be representing the compile time type information required for the compilation of coalton code. But on a careful reading, it seems that they also exist at runtime just like most things in the lisp world. So, your clarification helps!

The part I don't get then is this - and this might actually go on a tangent and so might deserve a separate issue/discussion: if coalton types are the same as lisp structures/classes at run time, and lisp structures (at least on SBCL) are non-zero cost abstractions, then does that mean coalton types are also non-zero cost abstractions? If so, have zero-cost data abstractions been in discussion in the context of coalton?

Regarding inlining,

Toplevel Coalton functions compile to defuns. You can just inline them with (declare (inline function-name)).

While this literally inlines them, it doesn't seem to actually inline them in a useful way. So, for example, in the below code -

(in-package :coalton-user)

(named-readtables:in-readtable coalton:coalton)

(cl:declaim (cl:inline add-two))

(coalton-toplevel  
  (monomorphize)
  (declare add-two (single-float -> single-float))
  (define (add-two x) (+ x 2.0)))

(cl:defun add-two-caller (x)
  (cl:declare (cl:optimize cl:speed (cl:safety 0) (cl:debug 0))
              (cl:type cl:single-float x))
  (add-two x))

The disassembly of add-two-caller (or even the add-two) actually includes a call to another function, and so the addition of two single floats is still a non-inline operation.

COALTON-USER> (disassemble 'add-two-caller)
; disassembly for ADD-TWO-CALLER
; Size: 31 bytes. Origin: #x5401A586                          ; ADD-TWO-CALLER
; 86:       4883EC10         SUB RSP, 16
; 8A:       488B3DC7FFFFFF   MOV RDI, [RIP-57]                ; 2.0
; 91:       B904000000       MOV ECX, 4
; 96:       48892C24         MOV [RSP], RBP
; 9A:       488BEC           MOV RBP, RSP
; 9D:       E820763CFC       CALL #x503E1BC2                  ; #<FDEFN COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+|>
; A2:       C9               LEAVE
; A3:       F8               CLC
; A4:       C3               RET
COMMON-LISP:NIL

eliaslfox commented 1 year ago

The part I don't get then is this - and this might actually go on a tangent and so might deserve a separate issue/discussion: if coalton types are the same as lisp structures/classes at run time, and lisp structures (at least on SBCL) are non-zero cost abstractions, then does that mean coalton types are also non-zero cost abstractions?

Types don't need to have a runtime cost, but constructing values almost always will. All of the following have equivalent runtime cost:

(define-type (T :a) (T String))

(define x (the (T Unit) (T "x)))

(define y (the (T (List Integer)) (T "y")))

(define z (the (T (List (List Integer))) (T "z")))

If so, have zero-cost data abstractions been in discussion in the context of coalton?

How would this work?

While this literally inlines them, it doesn't seem to actually inline them in a useful way.

Inlining call sites recursively is a difficult problem. It's easy to write a compiler pass to do it, but figuring out heuristics for which functions should be inlined is hard. Recursive inlining would also break the redefinition in the REPL. Coalton's inlining is on par with common lisp's (inline when told to) but much worse than the state of the art (global heuristic based inlining w/ LTO.) I don't think it's an unsolvable problem, but there's a lot of details to work out.

stylewarning commented 1 year ago

Toplevel Coalton functions compile to defuns. You can just inline them with (declare (inline function-name)).

@eliaslfox A DEFUN can only be inlined if it was declared INLINE before the definition. The pattern to make a function inlineable is

(declaim (inline f))
(defun f ...)
(declaim (notinline f))

This tells CL to save inlining info for F so as to be explicitly inlined later.

digikar99 commented 1 year ago

Inlining call sites recursively is a difficult problem. It's easy to write a compiler pass to do it, but figuring out heuristics for which functions should be inlined is hard. Recursive inlining would also break the redefinition in the REPL. Coalton's inlining is on par with common lisp's (inline when told to) but much worse than the state of the art (global heuristic based inlining w/ LTO.) I don't think it's an unsolvable problem, but there's a lot of details to work out.

I'm not referring to the heuristics themselves. May be there's a better word than "inlining" to refer to what I have in mind: I am okay with adding certain declarations or anything, but the end result I'm looking for is that the disassemble of add-two-caller should contain the assembly instruction ADDSS when the code is compiled, using say SBCL, on an intel CPU. At the moment, the disassembly instead contains a call to COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+ instead of a simple ADDSS instruction.

If so, have zero-cost data abstractions been in discussion in the context of coalton? How would this work?

I'm not sure how although I do have some ideas (EDIT: That idea is a dead-end.), but just to clarify - I should have written zero run-time cost data abstraction. For instance, when the following C code is compiled using gcc, then the disassembly of do_tf and do_raw are identical, no runtime cost is encoutered by using a twin_float structure/type:

struct twin_float{double a; double b;};

double do_tf(int n){
  double sum = 0;
  for(int i=0; i<n; i++){
    struct twin_float tf = {(double)i, (double)i};
    sum += tf.a;
    sum += tf.b;
  }
  return sum;
}

double do_raw(int n){
  double sum = 0;
  for(int i=0; i<n; i++){
    double a = (double)i;
    double b = (double)i;
    sum += a;
    sum += b;
  }
  return sum;
}

However, when the following code is compiled using SBCL 2.3.8 on a 64-bit intel processor, the disassembly of do-tf and do-raw are actually different - the code for do-tf actually contains instructions to create a twin-float value. And in this particular case, the runtime performance impact is non-existent to trivial at best, I still wonder if there are existing lisp/coalton efforts to address this.

(declaim (inline make-twin-float))
(defstruct (twin-float (:constructor make-twin-float (a b)))
  (a 0.0f0 :type double-float)
  (b 0.0f0 :type double-float))

(defun do-tf (n)
  (declare (optimize speed)
           (type double-float n))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (loop :for i :from 0.0d0 :below n :by 1.0d0
          :do (let ((tf (make-twin-float i i)))
                (declare (dynamic-extent tf))
                (incf sum (twin-float-a tf))
                (incf sum (twin-float-b tf))))
    sum))

(defun do-raw (n)
  (declare (optimize speed)
           (type double-float n))
  (let ((sum 0.0d0))
    (declare (type double-float sum))
    (loop :for i :from 0.0d0 :below n :by 1.0d0
          :do (let ((a i)
                    (b i))
                (incf sum a)
                (incf sum b)))
    sum))

As I understand, it is the following assembly code in do-tf that indicates the allocation of twin-float :

; 4E:       C70039080000     MOV DWORD PTR [RAX], 2105
; 54:       C74004039B2D50   MOV DWORD PTR [RAX+4], 1345166083

digikar99 commented 1 year ago

As I understand, the issue above is that + itself gets inlined, but its instances do not get inlined.

I am looking at the coalton code, and I haven't yet been able to understand how a toplevel define-instance form gets macroexpanded. A proposal was this: If a function-name has been declaimed inline previously, then the corresponding function instances created by define-instance also get declaimed inline. Later on, the function-name may be declaimed notinline to prevent inlining by default.

This way, whenever a user writes (declare (inline function-name)), then, not only is function-name inlined, but its appropriate instance is inlined too. In the case of + above, the instance is COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+|.

swapneils commented 9 months ago

Since coalton is our own interface layer and (I believe) we already walk the code when converting it to CL, can we track toplevel functions as objects and make a calling graph of function invocations, so that we can inline all coalton functions and just recompile already-compiled callers when the callee is redefined?

It's more orchestration overhead, but that should be regarding compilation time, not runtime.

eliaslfox commented 8 months ago

Yeah you could do that. If you changed the type signature of a function then you would need to re-typecheck all call sites.