Open digikar99 opened 1 year ago
The Lisp functions that Coalton makes aren't presently declared as inline.
Inlining in Coalton is a huge area of interest that we haven't tackled much at all, but it will ultimately be critical for high-speed, non-consing numerical code.
(Also, reminder about :coalton-release
.)
Oh, and second, the performance of hand-optimized Lisp being better than Coalton is a known area of improvement. (I hesitate to call it a bug, because it's not a regression, and it still calculates the correct thing.)
Did you try building with :coalton-release on your
*features*
?
No, I hadn't - thanks for the pointer! Although, even with this, I don't see much performance improvement.
So, that leads me to two thoughts -
Is there something that makes this separation necessary? Or is the merger something that is still to be thought and worked out?
Development Mode
Development mode allows most structures to be re-defined. This means:
- Most types turn into CLOS classes.
- Most types are not flattened or unwrapped in any way.
- Certain optimizations that obscure debugging are disabled.
Release Mode
Release mode freezes most structures and optimizes them.
- Most types are defined as frozen
defstruct
s or flattened Lisp data types.- Many optimizations are performed.
If I understand the concern about type safety, I'm guessing that the frozen types during release mode aid type safety by ensuring that the types don't get redefined after their first definition. But on the other hand, this hinders the rapid prototyping capabilities of lisp which involve redefinitions upon redefinitions, which make the development mode a necessity too. Is there something else that makes the separation necessary?
But that said, type safety seems a bit independent to whether or not a function is available for inspection during run-time. Regarding the latter, it should be possible to (optimize debug)
in both cases, so that the function calls (in coalton or in lisp) do not disappear from the call stack by inlining; or (optimize speed)
, so that the function calls may or may not remain on the call stack.
This should also be relevant to https://github.com/coalton-lang/coalton/discussions/978.
The Lisp functions that Coalton makes aren't presently declared as inline.
Inlining in Coalton is a huge area of interest that we haven't tackled much at all, but it will ultimately be critical for high-speed, non-consing numerical code.
Yes, this would be a deal breaker for anyone who wants to provide coalton-based libraries to both coalton and non-coalton users :(. At the outset, I don't see anything inherent that might prevent inlining coalton code into lisp, so I might look into it in the upcoming weeks. I do suspect this is going to require compiler-macros - and perhaps cl-form-types too to optimize and inline the non-trivial cases, and thus the CLTL2 environment API. But okay, that's going too fast, and as a first step we do need to get inlining working in the trivial cases.
Oh, and second, the performance of hand-optimized Lisp being better than Coalton is a known area of improvement.
Hmm, interesting. Though I do expect it would be easier to optimize coalton than lisp. And perhaps the difference should vanish once the inlining gets fixed.
Is there something that makes this separation necessary? Or is the merger something that is still to be thought and worked out?
The development/release mode divide isn't about type safety, it's about allowing both interactive development and performance. When you define a CLOS class, the accessors compile to something like:
(defclass point ()
((x :accessor point-x)
(y :accessor point-y)))
(defmethod ((obj pointj))
(slot-value point 'x))
Each field access pays the cost of a CLOS method lookup, and then the cost of slot-value. Structs on the other hand have a fixed layout which can make field access much faster.
CL-USER> (declare (optimize (speed 3) (safety 0)))
CL-USER> (defstruct point x y)
CL-USER> (disassemble 'point-x)
; disassembly for POINT-X
; Size: 20 bytes. Origin: #x700CFD6BF8 ; POINT-X
; BF8: 4AD140F8 LDR R0, [R0, #13]
; BFC: FB031AAA MOV CSP, CFP
; C00: 5A7B40A9 LDP CFP, LR, [CFP]
; C04: BF0300F1 CMP NULL, #0
; C08: C0035FD6 RET
OTOH the fixed field layout of structs makes safely redefining them generally impossible. Compiled code that accuses fields needs to be recompiled if the location of a field changes.
I do suspect this is going to require compiler-macros - and perhaps cl-form-types too to optimize and inline the non-trivial cases
Toplevel Coalton functions compile to defuns. You can just inline them with (declare (inline function-name))
.
In this specific microbenchmark the problem appears that + and * for floats aren't being inlined in the Coalton version. We used to inline every directly called method but I removed it because it broke redefinition. Some of the base methods in the standard library should probably be explicitly marked inline.
The development/release mode divide isn't about type safety, it's about allowing both interactive development and performance.
Oh, I see. Or, erm, no, I don't.
I had mistook the structures and classes to be representing the compile time type information required for the compilation of coalton code. But on a careful reading, it seems that they also exist at runtime just like most things in the lisp world. So, your clarification helps!
The part I don't get then is this - and this might actually go on a tangent and so might deserve a separate issue/discussion: if coalton types are the same as lisp structures/classes at run time, and lisp structures (at least on SBCL) are non-zero cost abstractions, then does that mean coalton types are also non-zero cost abstractions? If so, have zero-cost data abstractions been in discussion in the context of coalton?
Regarding inlining,
Toplevel Coalton functions compile to defuns. You can just inline them with
(declare (inline function-name))
.
While this literally inlines them, it doesn't seem to actually inline them in a useful way. So, for example, in the below code -
(in-package :coalton-user)
(named-readtables:in-readtable coalton:coalton)
(cl:declaim (cl:inline add-two))
(coalton-toplevel
(monomorphize)
(declare add-two (single-float -> single-float))
(define (add-two x) (+ x 2.0)))
(cl:defun add-two-caller (x)
(cl:declare (cl:optimize cl:speed (cl:safety 0) (cl:debug 0))
(cl:type cl:single-float x))
(add-two x))
The disassembly of add-two-caller
(or even the add-two
) actually includes a call to another function, and so the addition of two single floats is still a non-inline operation.
COALTON-USER> (disassemble 'add-two-caller)
; disassembly for ADD-TWO-CALLER
; Size: 31 bytes. Origin: #x5401A586 ; ADD-TWO-CALLER
; 86: 4883EC10 SUB RSP, 16
; 8A: 488B3DC7FFFFFF MOV RDI, [RIP-57] ; 2.0
; 91: B904000000 MOV ECX, 4
; 96: 48892C24 MOV [RSP], RBP
; 9A: 488BEC MOV RBP, RSP
; 9D: E820763CFC CALL #x503E1BC2 ; #<FDEFN COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+|>
; A2: C9 LEAVE
; A3: F8 CLC
; A4: C3 RET
COMMON-LISP:NIL
The part I don't get then is this - and this might actually go on a tangent and so might deserve a separate issue/discussion: if coalton types are the same as lisp structures/classes at run time, and lisp structures (at least on SBCL) are non-zero cost abstractions, then does that mean coalton types are also non-zero cost abstractions?
Types don't need to have a runtime cost, but constructing values almost always will. All of the following have equivalent runtime cost:
(define-type (T :a) (T String))
(define x (the (T Unit) (T "x)))
(define y (the (T (List Integer)) (T "y")))
(define z (the (T (List (List Integer))) (T "z")))
If so, have zero-cost data abstractions been in discussion in the context of coalton?
How would this work?
While this literally inlines them, it doesn't seem to actually inline them in a useful way.
Inlining call sites recursively is a difficult problem. It's easy to write a compiler pass to do it, but figuring out heuristics for which functions should be inlined is hard. Recursive inlining would also break the redefinition in the REPL. Coalton's inlining is on par with common lisp's (inline when told to) but much worse than the state of the art (global heuristic based inlining w/ LTO.) I don't think it's an unsolvable problem, but there's a lot of details to work out.
Toplevel Coalton functions compile to defuns. You can just inline them with (declare (inline function-name)).
@eliaslfox A DEFUN can only be inlined if it was declared INLINE before the definition. The pattern to make a function inlineable is
(declaim (inline f))
(defun f ...)
(declaim (notinline f))
This tells CL to save inlining info for F so as to be explicitly inlined later.
Inlining call sites recursively is a difficult problem. It's easy to write a compiler pass to do it, but figuring out heuristics for which functions should be inlined is hard. Recursive inlining would also break the redefinition in the REPL. Coalton's inlining is on par with common lisp's (inline when told to) but much worse than the state of the art (global heuristic based inlining w/ LTO.) I don't think it's an unsolvable problem, but there's a lot of details to work out.
I'm not referring to the heuristics themselves. May be there's a better word than "inlining" to refer to what I have in mind: I am okay with adding certain declarations or anything, but the end result I'm looking for is that the disassemble
of add-two-caller
should contain the assembly instruction ADDSS
when the code is compiled, using say SBCL, on an intel CPU. At the moment, the disassembly instead contains a call to COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+
instead of a simple ADDSS
instruction.
If so, have zero-cost data abstractions been in discussion in the context of coalton? How would this work?
I'm not sure how although I do have some ideas (EDIT: That idea is a dead-end.), but just to clarify - I should have written zero run-time cost data abstraction. For instance, when the following C code is compiled using gcc
, then the disassembly of do_tf
and do_raw
are identical, no runtime cost is encoutered by using a twin_float
structure/type:
struct twin_float{double a; double b;};
double do_tf(int n){
double sum = 0;
for(int i=0; i<n; i++){
struct twin_float tf = {(double)i, (double)i};
sum += tf.a;
sum += tf.b;
}
return sum;
}
double do_raw(int n){
double sum = 0;
for(int i=0; i<n; i++){
double a = (double)i;
double b = (double)i;
sum += a;
sum += b;
}
return sum;
}
However, when the following code is compiled using SBCL 2.3.8 on a 64-bit intel processor, the disassembly of do-tf
and do-raw
are actually different - the code for do-tf
actually contains instructions to create a twin-float
value. And in this particular case, the runtime performance impact is non-existent to trivial at best, I still wonder if there are existing lisp/coalton efforts to address this.
(declaim (inline make-twin-float))
(defstruct (twin-float (:constructor make-twin-float (a b)))
(a 0.0f0 :type double-float)
(b 0.0f0 :type double-float))
(defun do-tf (n)
(declare (optimize speed)
(type double-float n))
(let ((sum 0.0d0))
(declare (type double-float sum))
(loop :for i :from 0.0d0 :below n :by 1.0d0
:do (let ((tf (make-twin-float i i)))
(declare (dynamic-extent tf))
(incf sum (twin-float-a tf))
(incf sum (twin-float-b tf))))
sum))
(defun do-raw (n)
(declare (optimize speed)
(type double-float n))
(let ((sum 0.0d0))
(declare (type double-float sum))
(loop :for i :from 0.0d0 :below n :by 1.0d0
:do (let ((a i)
(b i))
(incf sum a)
(incf sum b)))
sum))
As I understand, it is the following assembly code in do-tf
that indicates the allocation of twin-float
:
; 4E: C70039080000 MOV DWORD PTR [RAX], 2105
; 54: C74004039B2D50 MOV DWORD PTR [RAX+4], 1345166083
As I understand, the issue above is that +
itself gets inlined, but its instances do not get inlined.
I am looking at the coalton code, and I haven't yet been able to understand how a toplevel define-instance
form gets macroexpanded. A proposal was this: If a function-name
has been declaimed inline previously, then the corresponding function instances created by define-instance
also get declaimed inline. Later on, the function-name
may be declaimed notinline
to prevent inlining by default.
This way, whenever a user writes (declare (inline function-name))
, then, not only is function-name
inlined, but its appropriate instance is inlined too. In the case of +
above, the instance is COALTON-LIBRARY/MATH/NUM::|INSTANCE/NUM SINGLE-FLOAT-COALTON-LIBRARY/CLASSES:+|
.
Since coalton
is our own interface layer and (I believe) we already walk the code when converting it to CL, can we track toplevel functions as objects and make a calling graph of function invocations, so that we can inline all coalton
functions and just recompile already-compiled callers when the callee is redefined?
It's more orchestration overhead, but that should be regarding compilation time, not runtime.
Yeah you could do that. If you changed the type signature of a function then you would need to re-typecheck all call sites.
Hello,
Something seems off regarding performance and inlining in the current master branch version of coalton, or may be I'm missing something.
I started SBCL with
rlwrap sbcl --no-userinit --load quicklisp/setup.lisp
:I expected coalton to take about the same time as the hand-optimized function, but it took significantly longer and seems closer to the lisp version than the hand-optimized version. (I also set the minimum and maximum clock speeds for my CPU at 0.8GHz.)
So, I checked the disassembly of the
add-two-caller
below, and it turns out to be not inlined:Am I missing something here - or is this a (known) bug?