Closed ChaosEternal closed 8 years ago
lrwxrwxrwx 1 chaos chaos 6 Apr 27 10:59 petite -> scheme
This is not strange. Run same program under petite, you'll find that the running time is not changed so much. Therefore it's not compiler's issue but runtime library's. Chez Scheme's bignum operation is indeed not as optimized as Vicare's which is powered by GMP.
@dybvig what's the reason not use GMP? maybe because of licenses?
@hyln9 Thanks for point out it.
I updated the test program:
(import (rnrs (6)))
(define (fib n)
(define (iter a b c)
(cond ((= c 0) b)
(#t (iter (mod (+ a b) 10232123) a (- c 1)))))
(iter 1 0 n)
)
(display (fib 1000000000))
and the output becames:
$ time scheme --script fib.scm
8461706
real 0m15.524s
user 0m15.528s
sys 0m0.008s
$ time vicare --r6rs-script fib.scm
real 0m24.490s
user 0m24.480s
sys 0m0.028s
I'm actually happy to see issues related to performance. Good performance is, after all, one of our primary goals. It's true that a thorough comparison would involve a lot more and more careful benchmarking, but it's nice to have examples where we aren't doing as well as we could, especially if it motivates contributors to submit fixes.
Using gmp would indeed speed up programs that operate on large numbers, particularly programs that multiply and divide very large numbers. Licensing is the primary issue blocking us from using it. Another is that our own, different representation of bignums is baked into the compiler and run-time system in a couple of ways.
@dybvig Thanks for explaining! Now that Chez has its own bignum system, maybe it's unnecessary to do porting contribution for it, and there's license issue, maybe it's just need polishing if any possible. That's the reason why I ask.
The original motivation makes me submitting this issue is that I am suspecting whether the code released here is the same as Chez used to be, blame me.
Thanks to @hyln9 's help, I finally realize that the performance difference is introduced by GMP. So i changed the subject to the current one. If it subject is still inappropriate, I can change it to anything else.
But My opinion is that the issue is valid and awaits to be resolved.
I think gmplib is LGPLv3 hence it is compatible with apache2.0 license.
libgmp is an extremely large and complex body of code that would weigh down the rest of the system, making it larger, compiling would take longer. I don't think it is necessarily a good idea to include it. Are you planning on implementing this @ChaosEternal? If so I would urge you to consider other bignum libraries.
Apache 2.0 licensed software can be used in LGPLv3 licensed projects, but not the other way around.
For what it's worth, the Scheme48 library (also used in the Chicken numbers egg) is written in Scheme.
Sorry about my comment. I have seen many biased and premature benchmark comparisons recently, so I tend to be sensitive such things may get into the issue system, converting github into a discussion forum ;)
It seems this one is sorta valid question about bignums. I'd still suggest removing the word "poor" from the title, because the performance is not poor as compared to similar languages with bignum implementations (such as Java).
Just for interest, I implemented the semantically same iterative fib function in three languages: Scheme, Java(using java's BigInteger
class) and Python.
(define fib
(lambda (n)
(define iter
(lambda (a b n)
(cond
[(= n 0) b]
[else
(iter (+ a b) a (- n 1))])))
(iter 1 0 n)))
The running results are quite interesting. I couldn't get Vicare built properly on the Mac after fixing several things in its build system, but it is essentially just Ikarus.
Here are the results running (fib 1000000)
on my Macbook Pro.
Ikarus 0.0.3: 7.69s
Chez Scheme: 16.69s
Petite Chez: 17.66s
Java 8: 15.71s
Python: 9.86s
So Chez's bignum performance is similar to that of Java. It's interesting how Python stands out in performance for bignums.
The performance of bignums in Ikarus is due to GMP. I took a look at the code of GMP and there is lots of processor specific optimizations done using assembly language and fancy instructions! Given how rarely people want super fast bignum arithmetic, I doubt Chez Scheme would want to have all that nasty details in the compiler ;)
But if you write a recursive implementation of fib:
(define fib
(lambda (n)
(cond
[(< n 2) n]
[else
(+ (fib (- n 1)) (fib (- n 2)))])))
and run (fib 48)
, you will get very different results.
Ikarus 0.0.3: 52.10s
Chez Scheme: 36.55s
Petite Chez: 409.53s
Racket 6.5: 62.46s
Java 8: 20.39s
C#(.net core): 191.82s
C#(VS): 40.10s
C(clang): 30.52s
C(gcc): 18.56s
Go: 31.56s
Swift: 36.37s
Python(PyPy): 189.47s
Python(CPython): >20 mins and still not finished
Take a look at Python's performance and compare with its bignum times :)
Do you want super fast bignum arithmetic that normal programs rarely use, and sacrifice everyday performance on function calls, data structures, closures? After all, you can still call GMP and fast C code for such rare computations with FFI, so I don't think this is a big issue.
@yinwang0 Agree. People often misunderstand where Chez's progressiveness lies, so toy code benchmark seems popular :)
However, efficient numeric tower operation is also essential. For bignum addition, GMP maybe wins on hardware specific optimization which is not generic and may not suitable for current Chez, but for multiplication and division, Chez's algorithm is already not desirable, which can be improved by a wide margin in common c.
There might be a way...
I found that Ruby and Python both use GMP, which makes them quite fast in this test. But Ruby is BSD license, and they seem to be able to call GMP, which is LGPL.
From this line, you can see that Ruby is using GMP:
But from doc/ChangeLog-1.9.3
(line 14778), you can see that they removed all LGPL code from the source:
Since Chez Scheme is Apache license which is very similar to BSD, I wonder if Chez Scheme can do the same thing as Ruby, link to GMP library but not include its code. Just ask the users install GMP by themselves if they need it.
That might be an option. It would require conversions from Chez Scheme's representation to GMP's and back for each call into GMP and presumably used only in cases where that overhead is justified. Or Chez Scheme's representation could be changed to match GMP's if GMP's representation is part of the public interface and the change doesn't add undue overhead of some other sort.
Gambit Scheme's Bignums are pure scheme and quite fast (not as fast as GMP right now but it has been faster on certain operations for brief periods in the past). Maybe porting their algorithms would be better than adding a dependency on GMP.
I just tried the bignum fib program on Gambit, and it's slower than Chez. I'm not sure if I got the right way of running it, but running both gsi fib.ss
and gsc fib.ss; gsi fib.o1
got similar results. It takes at least twice as much time as Chez.
I have run my second program with chicken scheme, it cost around 1 minutes to finish as a compiled chicken program.
You have to add some declarations to have gambit actually enable most of its optimizations. At minimum (declare (block) (standard-bindings)). It also depends on how you built gambit's own runtime.
A micro benchmark like this might be testing the allocator more than the bignum routines anyway.
I also wonder if I enabled Chez Scheme's optimizations properly. Any suggestions?
My only real suggestion is, to properly test bignum implementations head to head, it makes sense to find some real bignum benchmarks. Like I said above, this fib function is properly more aaffected by the allocation and grabage collection policies than the implementation of bignum+
Aubrey Jaffer's note "The Distribution of Integer Magnitudes in Polynomial Arithmetic" is worth a read: http://people.csail.mit.edu/jaffer/CNS/DIMPA
Using a symbolic algebra system he examines the sizes of the integers used in the program. The conclusion is that it is the performance of small bignums that is the most important. He adds:
The large reduction in frequency of occurrence versus bit-length means that small improvements in asymptotic running times for exotic bignum algorithms would bring negligible benefit in running this computer algebra program.
I have similar opinion, the fib example is more like a GC stress test. Guile (use libgmp for bignum) will be faster for fib if disabled GC. The note given by @soegaard reveals that Guile will do more collecting work if GC enabled, which seems the main reason of the loss.
If gc is activated however, guile performs the calculation within 200KB and calls gc several times. The whole thing takes about 25 seconds (!),
I don't know the difference of GC strategy between Chez and Vicare yet. But obviously this fib issue contains two parts:
IMO, it is too early to come to a conclusion about the bottleneck before deeper research.
Well, I have some results after simple research. I was planning to disable GC in Chez, but seems no option for it. Fortunately, Chez provides useful functions for tweaking GC. The original result is:
1749 collections
17.650081080s elapsed cpu time, including 0.096985402s collecting
17.651558134s elapsed real time, including 0.099052668s collecting
43407533616 bytes allocated, including 43366316784 bytes reclaimed
The I tweaked these two:
(collect-trip-bytes (* 10 (collect-trip-bytes))) ; enlarge the allocate block each time
(collect-generation-radix 1000000) ; set to very large to avoid collect frequently
Then the result is:
439 collections
18.025043206s elapsed cpu time, including 0.081797687s collecting
18.026762123s elapsed real time, including 0.082575259s collecting
43406709264 bytes allocated, including 43316982624 bytes reclaimed
Well, the collection and the cost time reduce effectively, but the cpu time seems almost unchanged. I don't know if it's the reasonably tweaking, but if it's correct for decreasing GC's activity, then my assumption is wrong: GC has little effect on this issue, and bignum operation matters.
@NalaGinrut You can disable the collector using the collection-request-handler
(from the 8.4 version of CSUG, chapter 13---http://scheme.com/csug8/smgmt.html#./smgmt:s16):
Automatic collection may be disabled by setting collect-request-handler to a procedure that does nothing, e.g.:
(collect-request-handler void)
However, as the collection time is fairly small I would hazard to guess it is actually the bignum arithmetic that is killing this. I suspect we'll have to improve the bignum operations in number.c to improve the results of this.
The fib test will indeed stress the GC a little, but this will only make the bignum calculation time worse for Chez Scheme, because Ikarus spends a lot more time in GC (1.5s Ikarus vs 0.1s Chez). Actually I doubt if you can do bignum benchmarks without using GC because bignums are allocated in the heap.
To be sure what we are talking about, this is the code:
(define fib-it
(lambda (n)
(define iter
(lambda (a b n)
(cond
[(= n 0) b]
[else
(iter (+ a b) a (- n 1))])))
(iter 1 0 n)))
It is iterative (tail recursive), so in principle we could allocate all stuff on stack and don't really need heap allocations or GC. But GC does happen because bignums are stored in the heap. I guess we can put them on the stack with the help of escape analysis.
Running (fib-it 1000000) cost Chez only 0.1 seconds in GC time (thanks to the generational collector?). So most of the time (17s) was spent in bignum calculations.
Running bignum test in Chez Scheme (0.1s GC time):
> (time (display (< 1 (fib-it 1000000))))
#t(time (display (< 1 ...)))
1750 collections
17.173180000s elapsed cpu time, including 0.100478000s collecting
17.178098000s elapsed real time, including 0.102999000s collecting
43407456912 bytes allocated, including 43407730416 bytes reclaimed
In comparison, Ikarus spends significantly more time (1.5s) in GC:
> (time (display (< 1 (fib-it 1000000))))
#trunning stats for (display (< 1 (fib-it 1000000))):
10421 collections
7171 ms elapsed cpu time, including 1512 ms collecting
7172 ms elapsed real time, including 1521 ms collecting
43402029744 bytes allocated
This means that Chez Scheme has a superior garbage collector than Ikarus. To be sure, you can test Ikarus with a simple factorial program, but written in two different ways (recursive and iterative):
Recursive factorial in Ikarus (caused 8.9s GC time):
(define fact
(lambda (n)
(cond
[(= n 0) 1]
[else (* n (fact (- n 1)))])))
> (time (< 1 (fact 200000)))
running stats for (< 1 (fact 200000)):
9518 collections
15006 ms elapsed cpu time, including 8915 ms collecting
15013 ms elapsed real time, including 8930 ms collecting
38616435792 bytes allocated
Iterative factorial in Ikarus (caused only 3s GC time):
(define fact-it
(lambda (n)
(define fact1
(lambda (n prod)
(cond
[(= n 0) prod]
[else (fact1 (- n 1) (* n prod))])))
(fact1 n 1)))
> (time (< 1 (fact-it 200000)))
running stats for (< 1 (fact-it 200000)):
10425 collections
9874 ms elapsed cpu time, including 3020 ms collecting
9883 ms elapsed real time, including 3035 ms collecting
42222965144 bytes allocated
Running both fact and fact-it in Chez, you won't notice much difference in GC time. Both are about 0.1s!
Recursive factorial in Chez (0.11s GC time):
> (time (< 1 (fact 200000)))
(time (< 1 ...))
195 collections
25.422836000s elapsed cpu time, including 0.110350000s collecting
25.465419000s elapsed real time, including 0.110865000s collecting
38622551408 bytes allocated, including 38222576752 bytes reclaimed
Iterative factorial in Chez (0.11s GC time):
> (time (< 1 (fact-it 200000)))
(time (< 1 ...))
390 collections
27.385098000s elapsed cpu time, including 0.115547000s collecting
27.442736000s elapsed real time, including 0.116543000s collecting
42224402672 bytes allocated, including 42222535536 bytes reclaimed
So indeed Ikarus's garbage collector is not as good as Chez's, which makes bignum calculation time even worse for Chez ;) Since factorial uses multiplication, we can see GMP does make a big difference.
I agree that we need a better bignum benchmark, but a simple example seems good enough to demonstrate what's going on.
@lemaster @NalaGinrut @akeep @yinwang0
Firstly I think for multiplication and division there is no maze.
For addition I quickly wrote a benchmark for GMP:
#include <stdio.h>
#include <gmp.h>
int main()
{
mpz_t a, b, c;
mpz_init(a);
mpz_init(b);
mpz_init(c);
mpz_ui_pow_ui(a, 2, 1000000);
mpz_ui_pow_ui(b, 2, 1000000);
printf("start!\n");
for(int i = 0; i < 1000000; i++)
{
mpz_add(c, a, b);
}
}
We can hardly do in-place bignum addition in scheme, so allocation could be more a problem than collection because we can combine multiple collections in GC. However we do not know the allocation time in Chez Scheme, but seems it can also be ignored according to comparison between Ikarus and Chez (see below) which is inspired by @yinwang0 's work.
For clarification, scheme version is also available:
; for Chez
(run-cp0 (lambda (cp0 x) x))
(optimize-level 0)
; for Ikarus
(cp0-effort-limit 0)
(optimize-level 0)
; for both
(let [(a (expt 2 1000000))
(b (expt 2 1000000))]
(time
(do [(i 0 (+ i 1))
(c 0 (+ a b))]
[(= i 1000000)])))
On my machine, results are as below(warming-up done):
GMP: (time before "start" is too short so included)
real 0m10.100s
user 0m10.096s
sys 0m0.000s
Chez:
2000 collections
55.476879137s elapsed cpu time, including 0.244292071s collecting
55.481766220s elapsed real time, including 0.247766474s collecting
125025280000 bytes allocated, including 125028249216 bytes reclaimed
Ikarus:
14925 collections
14660 ms elapsed cpu time, including 896 ms collecting
14663 ms elapsed real time, including 902 ms collecting
125024000032 bytes allocated
So addition performance is still low compared to GMP.
I wrote a similar Java program doing (fact 200000)
. It shows that Java's BigInteger has better multiplication performance than Chez's. It took Java 12.8s.
In summary:
Ikarus: 9.9s
Java: 12.8s
Chez: 27.4s
GMP is still the best among the three, for obvious reasons.
Another interesting thing I observed is that if we use higher optimize-level
, Ikarus can perform DCE on my benchmark code above while Chez cannot whether in REPL or command line, with "--program" or without, or with cp0
or without.
What does "DCE" stand for?
@yinwang0 "dead code elimination", my bad.
As we think about trying to improve the bignum implementation in Chez, it might be interesting to take a look at the Glasgow Haskell Compiler (GHC) work around the same issue: https://ghc.haskell.org/trac/ghc/wiki/ReplacingGMPNotes.
They started from a different point, in that GHC had been using the GNU MP, but there were uses of GHC for whom the LGPL licensing of GNU MP was problematic (see: https://ghc.haskell.org/trac/ghc/ticket/601). They experimented with a number of other fast arbitrary precision libraries with more permissive licenses, as well as creating a "fast enough" implementation in Haskell. One (or more) of these more permissively licensed libraries might be an interesting option for improving Chez's bignum performance. As of 7 years ago (when the GHC community went through this exercise), there were quite a few interesting trade offs in performance given the libraries available at the time. I found an additional library, bigz, listed on the wikipedia arbitrary precision page. Since it has been 7 years since the GHC community did these experiments, it is probably worth evaluating the options again.
@hyln9 I'll have to look at the DCE as a separate issue.
@akeep Nice information. Personally, I don't like GNU MP for its license. If we want to make Chez reliable and clean, I'd prefer directly improving implementations in number.c
instead of being dependent on 3rd party libraries. I can help with it if appropriate.
@yinwang0: This is Scheme, where fixnum overflow (using the term "fixnum" broadly) can return an exact non-fixnum, or an inexact approximation, or raise an implementation-restriction error, depending on the implementation. This is true of all standards from R4RS onwards at least, and most . See http://trac.sacrideo.us/wg/wiki/NumericTower for what many existing Schemes actually do.
@hyln9 @akeep I think this fib case is too simple to take advantage of DCE, so maybe raise another issue?
Personally, I'm not interested in porting libgmp to Chez. What I'm afraid is that is that very big number operation is optimized to be faster while small number is slower for a compromise. For Chez, I think the strategy could be less extreme. Maybe we could find a way to improve big number operation in a limited expectation (say, not as fast as libgmp), but it shouldn't drag the small number operation. Well, I confess it sounds idealism. Anyway, if we can't find a way to improve big number without effecting small number, I prefer keep it as it is.
@NalaGinrut
Chez obviously do DCE, so it is whether a bug or misunderstanding, rather than a performance issue which needs complex benchmark.
As far as I know, there is no evidence which shows that GMP has drawbacks on relatively small (i.e. mid-size) numbers. I agree with your opinions on performance but fixnum is managed by compiler itself without overhead while there are numbers of algorithms or strategies on mid-size bignum operations.
@hyln9 Alright, I've taken a look again, GMP use different algorithm for different operand size. So maybe not the case I'm afraid of. I apologize for confusing here.
I looked at the code of Chez's number.c and wonder if there is a simple way of make addition faster for 64-bit machines.
I see that the bigit
type is always defined as U32, and big_add
operates on bigits, so it's not going to use 64-bit addition instruction, thus is doing twice as many additions.
@yinwang0 In addition, since adc instruction (or intrinsic) is widely available, the EADDC macro is unnecessary. In the end we are getting farther and farther from standard c…
@hyln9 Indeed, it is the ADC instruction doing the trick in GMP. I played with GMP's configuration again, and found that Ikarus can't use GMP's 64-bit ABI. I had to compile GMP with ABI=32, so Ikarus was not utilizing the 64-bit instructions. But it's still a lot faster because it's using x86's ADC instruction.
The code that is in actual use by Ikarus on my machine is in this file:
mpn/x86/p6/aors_n.asm
It's symbol linked to mpn/add_n.asm
after running ./configure ABI=32
.
The assembly code looks like this, where ADCSBB
is defined to be adc
earlier. It's just repeatedly calling adc.
define(ADCSBB, adc)
... ...
L(top):
jecxz L(end)
L(ent):
Zdisp( mov, 0,(up,n,4), %eax)
Zdisp( ADCSBB, 0,(vp,n,4), %eax)
Zdisp( mov, %eax, 0,(rp,n,4))
mov 4(up,n,4), %edx
ADCSBB 4(vp,n,4), %edx
mov %edx, 4(rp,n,4)
mov 8(up,n,4), %eax
ADCSBB 8(vp,n,4), %eax
mov %eax, 8(rp,n,4)
mov 12(up,n,4), %edx
ADCSBB 12(vp,n,4), %edx
mov %edx, 12(rp,n,4)
mov 16(up,n,4), %eax
ADCSBB 16(vp,n,4), %eax
mov %eax, 16(rp,n,4)
mov 20(up,n,4), %edx
ADCSBB 20(vp,n,4), %edx
mov %edx, 20(rp,n,4)
mov 24(up,n,4), %eax
ADCSBB 24(vp,n,4), %eax
mov %eax, 24(rp,n,4)
mov 28(up,n,4), %edx
ADCSBB 28(vp,n,4), %edx
mov %edx, 28(rp,n,4)
lea 8(n), n
jmp L(top)
@yinwang0
GMP's implementation of addition above combines adc with loop unrolling, which increases the ratio of arithmetic instructions.
On the other hand 64bit and 32 bit adc have same throughput on x86_64, therefore double the potential performance.
But all these might be off-topic.
It would be great for someone to add the requisite ifdefs and asm instructions to use more efficient operators, and for someone to implement fancier algorithms, but please be careful in the process not to copy (or even study) code from GNU-licensed systems like GMP and Ikarus. Though Aziz would probably be willing to put the portions of the code you want to use under the Apache 2.0 license.
@dybvig
Yeah, that the reason why I haven't look at GMP's code yet except for the above one, so I cannot know GMP's algorithms. My major references are wikipedia pages and academic papers without code (since I am an undergraduate I do have access). Considering the known nine-line code court case we indeed need a clean room implementation.
As for Ikarus, things might be different because the techniques are different.
@hyln9 GMP detected my processor as Haswell correctly and configured to use 64bit instructions and registers. It's just Ikarus 0.0.3 that can't use 64bit. So it's using adc with unrolling only with addition. It's using some mmx and sse instructions with other operations. Addition seems fast enough with just ADC and easy to implement with GCC's extended asm.
@yinwang0 It seems that there are no SIMD instructions for adding with carry on x86 yet.
I doubt there will be SIMD addition with carry because the carry propagation is hard to do in parallel. If you suceed then you effectively made a 128 bit machine or higher.
http://stackoverflow.com/questions/27923192/practical-bignum-avx-sse-possible
It sounds like we have to rewrite all bignum operations and maintain related assembly code for all supported platforms, which is already done by libgmp. Take an example of addition operation. If we don't care about x86, maybe just rewrite EADDC is enough (no?). For X86 or certain platform doesn't contain needed AVX in the future, it's necessary to keep current EADDC. Well, maybe libgmp is still an option.
I wrote a small demo program using GCC's extended assembly and the ADC instruction.
https://gist.github.com/yinwang0/290f34bb567a896eada4745173aa4477
The main part of the demo is exactly following the names in big_add_pos
so it's easy to swap the code into Chez Scheme.
The demo itself seems to be correct, but after swapping the code into big_add_pos
and rebuild Chez Scheme after "make clean", I got some error saying "nonrecoverable invalid memory reference".
It looks like a good starting point. I'm going to look more, but to make the development "parallel", some of you may want to try and find out how to make it work.
For your convenience of offering help, I committed the changes to my fork:
https://github.com/yinwang0/ChezScheme/commits/improve-big-add
@NalaGinrut It looks Scheme doesn't provide that many bignum operators, and the open sourced Chez Scheme doesn't support that many architectures. I think it's worthwhile enough if we can just make the few operations fast under x86, because that's what most people use.
;; the fib iterative (import (rnrs))
(define (fib n) (define (iter a b c) (cond ((= c 0) b) (#t (iter (+ a b) a (- c 1))))) (iter 1 0 n) )
(display (fib 1000000))
Chez Scheme Version 9.4 Copyright 1984-2016 Cisco Systems, Inc.
$ time scheme --program fib.scm >/dev/null
real 0m23.007s user 0m22.956s sys 0m0.076s
Copyright (c) 2006-2010 Abdulaziz Ghuloum and contributors Copyright (c) 2011-2015 Marco Maggi and contributors
$ time vicare fib.scm >/dev/null
real 0m6.372s user 0m5.900s sys 0m0.468s