cisco / ChezScheme

Chez Scheme
Apache License 2.0
6.96k stars 984 forks source link

Poor performance when dealing with multi-precision numbers #14

Closed ChaosEternal closed 8 years ago

ChaosEternal commented 8 years ago

;; the fib iterative (import (rnrs))

(define (fib n) (define (iter a b c) (cond ((= c 0) b) (#t (iter (+ a b) a (- c 1))))) (iter 1 0 n) )

(display (fib 1000000))

Chez Scheme Version 9.4 Copyright 1984-2016 Cisco Systems, Inc.

$ time scheme --program fib.scm >/dev/null

real 0m23.007s user 0m22.956s sys 0m0.076s

Copyright (c) 2006-2010 Abdulaziz Ghuloum and contributors Copyright (c) 2011-2015 Marco Maggi and contributors

$ time vicare fib.scm >/dev/null

real 0m6.372s user 0m5.900s sys 0m0.468s

ChaosEternal commented 8 years ago

lrwxrwxrwx 1 chaos chaos 6 Apr 27 10:59 petite -> scheme

hyln9 commented 8 years ago

This is not strange. Run same program under petite, you'll find that the running time is not changed so much. Therefore it's not compiler's issue but runtime library's. Chez Scheme's bignum operation is indeed not as optimized as Vicare's which is powered by GMP.

NalaGinrut commented 8 years ago

@dybvig what's the reason not use GMP? maybe because of licenses?

ChaosEternal commented 8 years ago

@hyln9 Thanks for point out it.

I updated the test program:

(import (rnrs (6)))

(define (fib n)
  (define (iter a b c)
    (cond ((= c 0) b)
          (#t (iter (mod (+ a b) 10232123) a (- c 1)))))
  (iter 1 0 n)
)

(display (fib 1000000000))

and the output becames:

$ time scheme --script fib.scm
8461706
real    0m15.524s
user    0m15.528s
sys     0m0.008s
$ time vicare --r6rs-script fib.scm                                                                                                                                           

real    0m24.490s
user    0m24.480s
sys     0m0.028s
dybvig commented 8 years ago

I'm actually happy to see issues related to performance. Good performance is, after all, one of our primary goals. It's true that a thorough comparison would involve a lot more and more careful benchmarking, but it's nice to have examples where we aren't doing as well as we could, especially if it motivates contributors to submit fixes.

Using gmp would indeed speed up programs that operate on large numbers, particularly programs that multiply and divide very large numbers. Licensing is the primary issue blocking us from using it. Another is that our own, different representation of bignums is baked into the compiler and run-time system in a couple of ways.

NalaGinrut commented 8 years ago

@dybvig Thanks for explaining! Now that Chez has its own bignum system, maybe it's unnecessary to do porting contribution for it, and there's license issue, maybe it's just need polishing if any possible. That's the reason why I ask.

ChaosEternal commented 8 years ago

The original motivation makes me submitting this issue is that I am suspecting whether the code released here is the same as Chez used to be, blame me.

Thanks to @hyln9 's help, I finally realize that the performance difference is introduced by GMP. So i changed the subject to the current one. If it subject is still inappropriate, I can change it to anything else.

But My opinion is that the issue is valid and awaits to be resolved.

ChaosEternal commented 8 years ago

I think gmplib is LGPLv3 hence it is compatible with apache2.0 license.

rain-1 commented 8 years ago

libgmp is an extremely large and complex body of code that would weigh down the rest of the system, making it larger, compiling would take longer. I don't think it is necessarily a good idea to include it. Are you planning on implementing this @ChaosEternal? If so I would urge you to consider other bignum libraries.

dybvig commented 8 years ago

Apache 2.0 licensed software can be used in LGPLv3 licensed projects, but not the other way around.

johnwcowan commented 8 years ago

For what it's worth, the Scheme48 library (also used in the Chicken numbers egg) is written in Scheme.

yinwang0 commented 8 years ago

Sorry about my comment. I have seen many biased and premature benchmark comparisons recently, so I tend to be sensitive such things may get into the issue system, converting github into a discussion forum ;)

It seems this one is sorta valid question about bignums. I'd still suggest removing the word "poor" from the title, because the performance is not poor as compared to similar languages with bignum implementations (such as Java).

Bignum results

Just for interest, I implemented the semantically same iterative fib function in three languages: Scheme, Java(using java's BigInteger class) and Python.

(define fib
  (lambda (n)
    (define iter
      (lambda (a b n)
        (cond
         [(= n 0) b]
         [else
          (iter (+ a b) a (- n 1))])))
    (iter 1 0 n)))

The running results are quite interesting. I couldn't get Vicare built properly on the Mac after fixing several things in its build system, but it is essentially just Ikarus.

Here are the results running (fib 1000000) on my Macbook Pro.

Ikarus 0.0.3:   7.69s
Chez Scheme:   16.69s
Petite Chez:   17.66s
Java 8:        15.71s
Python:         9.86s

So Chez's bignum performance is similar to that of Java. It's interesting how Python stands out in performance for bignums.

The performance of bignums in Ikarus is due to GMP. I took a look at the code of GMP and there is lots of processor specific optimizations done using assembly language and fancy instructions! Given how rarely people want super fast bignum arithmetic, I doubt Chez Scheme would want to have all that nasty details in the compiler ;)

Recursive fib results

But if you write a recursive implementation of fib:

(define fib
  (lambda (n)
    (cond
     [(< n 2) n]
     [else
      (+ (fib (- n 1)) (fib (- n 2)))])))

and run (fib 48), you will get very different results.

Ikarus 0.0.3:      52.10s
Chez Scheme:       36.55s
Petite Chez:      409.53s
Racket 6.5:        62.46s
Java 8:            20.39s
C#(.net core):    191.82s
C#(VS):            40.10s
C(clang):          30.52s
C(gcc):            18.56s
Go:                31.56s
Swift:             36.37s
Python(PyPy):     189.47s
Python(CPython):   >20 mins and still not finished

Take a look at Python's performance and compare with its bignum times :)

Do you want super fast bignum arithmetic that normal programs rarely use, and sacrifice everyday performance on function calls, data structures, closures? After all, you can still call GMP and fast C code for such rare computations with FFI, so I don't think this is a big issue.

hyln9 commented 8 years ago

@yinwang0 Agree. People often misunderstand where Chez's progressiveness lies, so toy code benchmark seems popular :)

However, efficient numeric tower operation is also essential. For bignum addition, GMP maybe wins on hardware specific optimization which is not generic and may not suitable for current Chez, but for multiplication and division, Chez's algorithm is already not desirable, which can be improved by a wide margin in common c.

yinwang0 commented 8 years ago

There might be a way...

I found that Ruby and Python both use GMP, which makes them quite fast in this test. But Ruby is BSD license, and they seem to be able to call GMP, which is LGPL.

From this line, you can see that Ruby is using GMP:

https://github.com/ruby/ruby/blob/trunk/doc/NEWS-2.1.0#L73

But from doc/ChangeLog-1.9.3 (line 14778), you can see that they removed all LGPL code from the source:

https://github.com/ruby/ruby/blob/trunk/doc/ChangeLog-1.9.3

Since Chez Scheme is Apache license which is very similar to BSD, I wonder if Chez Scheme can do the same thing as Ruby, link to GMP library but not include its code. Just ask the users install GMP by themselves if they need it.

dybvig commented 8 years ago

That might be an option. It would require conversions from Chez Scheme's representation to GMP's and back for each call into GMP and presumably used only in cases where that overhead is justified. Or Chez Scheme's representation could be changed to match GMP's if GMP's representation is part of the public interface and the change doesn't add undue overhead of some other sort.

lemaster commented 8 years ago

Gambit Scheme's Bignums are pure scheme and quite fast (not as fast as GMP right now but it has been faster on certain operations for brief periods in the past). Maybe porting their algorithms would be better than adding a dependency on GMP.

yinwang0 commented 8 years ago

I just tried the bignum fib program on Gambit, and it's slower than Chez. I'm not sure if I got the right way of running it, but running both gsi fib.ss and gsc fib.ss; gsi fib.o1 got similar results. It takes at least twice as much time as Chez.

ChaosEternal commented 8 years ago

I have run my second program with chicken scheme, it cost around 1 minutes to finish as a compiled chicken program.

lemaster commented 8 years ago

You have to add some declarations to have gambit actually enable most of its optimizations. At minimum (declare (block) (standard-bindings)). It also depends on how you built gambit's own runtime.

A micro benchmark like this might be testing the allocator more than the bignum routines anyway.

yinwang0 commented 8 years ago

I also wonder if I enabled Chez Scheme's optimizations properly. Any suggestions?

lemaster commented 8 years ago

My only real suggestion is, to properly test bignum implementations head to head, it makes sense to find some real bignum benchmarks. Like I said above, this fib function is properly more aaffected by the allocation and grabage collection policies than the implementation of bignum+

soegaard commented 8 years ago

Aubrey Jaffer's note "The Distribution of Integer Magnitudes in Polynomial Arithmetic" is worth a read: http://people.csail.mit.edu/jaffer/CNS/DIMPA

Using a symbolic algebra system he examines the sizes of the integers used in the program. The conclusion is that it is the performance of small bignums that is the most important. He adds:

The large reduction in frequency of occurrence versus bit-length means that small improvements in asymptotic running times for exotic bignum algorithms would bring negligible benefit in running this computer algebra program.

NalaGinrut commented 8 years ago

I have similar opinion, the fib example is more like a GC stress test. Guile (use libgmp for bignum) will be faster for fib if disabled GC. The note given by @soegaard reveals that Guile will do more collecting work if GC enabled, which seems the main reason of the loss.

If gc is activated however, guile performs the calculation within 200KB and calls gc several times. The whole thing takes about 25 seconds (!),

I don't know the difference of GC strategy between Chez and Vicare yet. But obviously this fib issue contains two parts:

  1. operation efficiency of bignum (related to bignum system)
  2. alloc/collect efficiency of bignum (related to GC)

IMO, it is too early to come to a conclusion about the bottleneck before deeper research.

NalaGinrut commented 8 years ago

Well, I have some results after simple research. I was planning to disable GC in Chez, but seems no option for it. Fortunately, Chez provides useful functions for tweaking GC. The original result is:

    1749 collections
    17.650081080s elapsed cpu time, including 0.096985402s collecting
    17.651558134s elapsed real time, including 0.099052668s collecting
    43407533616 bytes allocated, including 43366316784 bytes reclaimed

The I tweaked these two:

(collect-trip-bytes (* 10 (collect-trip-bytes))) ; enlarge the allocate block each time
(collect-generation-radix 1000000) ; set to very large to avoid collect frequently

Then the result is:

    439 collections
    18.025043206s elapsed cpu time, including 0.081797687s collecting
    18.026762123s elapsed real time, including 0.082575259s collecting
    43406709264 bytes allocated, including 43316982624 bytes reclaimed

Well, the collection and the cost time reduce effectively, but the cpu time seems almost unchanged. I don't know if it's the reasonably tweaking, but if it's correct for decreasing GC's activity, then my assumption is wrong: GC has little effect on this issue, and bignum operation matters.

akeep commented 8 years ago

@NalaGinrut You can disable the collector using the collection-request-handler (from the 8.4 version of CSUG, chapter 13---http://scheme.com/csug8/smgmt.html#./smgmt:s16):

Automatic collection may be disabled by setting collect-request-handler to a procedure that does nothing, e.g.:

(collect-request-handler void)

However, as the collection time is fairly small I would hazard to guess it is actually the bignum arithmetic that is killing this. I suspect we'll have to improve the bignum operations in number.c to improve the results of this.

yinwang0 commented 8 years ago

The fib test will indeed stress the GC a little, but this will only make the bignum calculation time worse for Chez Scheme, because Ikarus spends a lot more time in GC (1.5s Ikarus vs 0.1s Chez). Actually I doubt if you can do bignum benchmarks without using GC because bignums are allocated in the heap.

To be sure what we are talking about, this is the code:

(define fib-it
  (lambda (n)
    (define iter
      (lambda (a b n)
        (cond
         [(= n 0) b]
         [else
          (iter (+ a b) a (- n 1))])))
    (iter 1 0 n)))

It is iterative (tail recursive), so in principle we could allocate all stuff on stack and don't really need heap allocations or GC. But GC does happen because bignums are stored in the heap. I guess we can put them on the stack with the help of escape analysis.

Running (fib-it 1000000) cost Chez only 0.1 seconds in GC time (thanks to the generational collector?). So most of the time (17s) was spent in bignum calculations.

Running bignum test in Chez Scheme (0.1s GC time):

> (time (display (< 1 (fib-it 1000000))))
#t(time (display (< 1 ...)))
    1750 collections
    17.173180000s elapsed cpu time, including 0.100478000s collecting
    17.178098000s elapsed real time, including 0.102999000s collecting
    43407456912 bytes allocated, including 43407730416 bytes reclaimed

In comparison, Ikarus spends significantly more time (1.5s) in GC:

> (time (display (< 1 (fib-it 1000000))))
#trunning stats for (display (< 1 (fib-it 1000000))):
    10421 collections
    7171 ms elapsed cpu time, including 1512 ms collecting
    7172 ms elapsed real time, including 1521 ms collecting
    43402029744 bytes allocated

This means that Chez Scheme has a superior garbage collector than Ikarus. To be sure, you can test Ikarus with a simple factorial program, but written in two different ways (recursive and iterative):

Recursive factorial in Ikarus (caused 8.9s GC time):

(define fact
  (lambda (n)
    (cond
     [(= n 0) 1]
     [else (* n (fact (- n 1)))])))
> (time (< 1 (fact 200000)))
running stats for (< 1 (fact 200000)):
    9518 collections
    15006 ms elapsed cpu time, including 8915 ms collecting
    15013 ms elapsed real time, including 8930 ms collecting
    38616435792 bytes allocated

Iterative factorial in Ikarus (caused only 3s GC time):

(define fact-it
  (lambda (n)
    (define fact1
      (lambda (n prod)
        (cond
         [(= n 0) prod]
         [else (fact1 (- n 1) (* n prod))])))
    (fact1 n 1)))
> (time (< 1 (fact-it 200000)))
running stats for (< 1 (fact-it 200000)):
    10425 collections
    9874 ms elapsed cpu time, including 3020 ms collecting
    9883 ms elapsed real time, including 3035 ms collecting
    42222965144 bytes allocated

Running both fact and fact-it in Chez, you won't notice much difference in GC time. Both are about 0.1s!

Recursive factorial in Chez (0.11s GC time):

> (time (< 1 (fact 200000)))
(time (< 1 ...))
    195 collections
    25.422836000s elapsed cpu time, including 0.110350000s collecting
    25.465419000s elapsed real time, including 0.110865000s collecting
    38622551408 bytes allocated, including 38222576752 bytes reclaimed

Iterative factorial in Chez (0.11s GC time):

> (time (< 1 (fact-it 200000)))
(time (< 1 ...))
    390 collections
    27.385098000s elapsed cpu time, including 0.115547000s collecting
    27.442736000s elapsed real time, including 0.116543000s collecting
    42224402672 bytes allocated, including 42222535536 bytes reclaimed

So indeed Ikarus's garbage collector is not as good as Chez's, which makes bignum calculation time even worse for Chez ;) Since factorial uses multiplication, we can see GMP does make a big difference.

I agree that we need a better bignum benchmark, but a simple example seems good enough to demonstrate what's going on.

hyln9 commented 8 years ago

@lemaster @NalaGinrut @akeep @yinwang0

Firstly I think for multiplication and division there is no maze.

For addition I quickly wrote a benchmark for GMP:

#include <stdio.h>
#include <gmp.h>

int main()
{
  mpz_t a, b, c;
  mpz_init(a);
  mpz_init(b);
  mpz_init(c);
  mpz_ui_pow_ui(a, 2, 1000000);
  mpz_ui_pow_ui(b, 2, 1000000);
  printf("start!\n");
  for(int i = 0; i < 1000000; i++)
  {
    mpz_add(c, a, b);
  }
}

We can hardly do in-place bignum addition in scheme, so allocation could be more a problem than collection because we can combine multiple collections in GC. However we do not know the allocation time in Chez Scheme, but seems it can also be ignored according to comparison between Ikarus and Chez (see below) which is inspired by @yinwang0 's work.

For clarification, scheme version is also available:

; for Chez
(run-cp0 (lambda (cp0 x) x))
(optimize-level 0)

; for Ikarus
(cp0-effort-limit 0)
(optimize-level 0)

; for both
(let [(a (expt 2 1000000))
      (b (expt 2 1000000))]
  (time
    (do [(i 0 (+ i 1))
         (c 0 (+ a b))]
        [(= i 1000000)])))

On my machine, results are as below(warming-up done):

GMP: (time before "start" is too short so included)

real    0m10.100s
user    0m10.096s
sys     0m0.000s

Chez:

2000 collections
55.476879137s elapsed cpu time, including 0.244292071s collecting
55.481766220s elapsed real time, including 0.247766474s collecting
125025280000 bytes allocated, including 125028249216 bytes reclaimed

Ikarus:

14925 collections
14660 ms elapsed cpu time, including 896 ms collecting
14663 ms elapsed real time, including 902 ms collecting
125024000032 bytes allocated

So addition performance is still low compared to GMP.

yinwang0 commented 8 years ago

I wrote a similar Java program doing (fact 200000). It shows that Java's BigInteger has better multiplication performance than Chez's. It took Java 12.8s.

In summary:

Ikarus: 9.9s
Java: 12.8s
Chez: 27.4s

GMP is still the best among the three, for obvious reasons.

hyln9 commented 8 years ago

Another interesting thing I observed is that if we use higher optimize-level, Ikarus can perform DCE on my benchmark code above while Chez cannot whether in REPL or command line, with "--program" or without, or with cp0 or without.

yinwang0 commented 8 years ago

What does "DCE" stand for?

hyln9 commented 8 years ago

@yinwang0 "dead code elimination", my bad.

akeep commented 8 years ago

As we think about trying to improve the bignum implementation in Chez, it might be interesting to take a look at the Glasgow Haskell Compiler (GHC) work around the same issue: https://ghc.haskell.org/trac/ghc/wiki/ReplacingGMPNotes.

They started from a different point, in that GHC had been using the GNU MP, but there were uses of GHC for whom the LGPL licensing of GNU MP was problematic (see: https://ghc.haskell.org/trac/ghc/ticket/601). They experimented with a number of other fast arbitrary precision libraries with more permissive licenses, as well as creating a "fast enough" implementation in Haskell. One (or more) of these more permissively licensed libraries might be an interesting option for improving Chez's bignum performance. As of 7 years ago (when the GHC community went through this exercise), there were quite a few interesting trade offs in performance given the libraries available at the time. I found an additional library, bigz, listed on the wikipedia arbitrary precision page. Since it has been 7 years since the GHC community did these experiments, it is probably worth evaluating the options again.

@hyln9 I'll have to look at the DCE as a separate issue.

hyln9 commented 8 years ago

@akeep Nice information. Personally, I don't like GNU MP for its license. If we want to make Chez reliable and clean, I'd prefer directly improving implementations in number.c instead of being dependent on 3rd party libraries. I can help with it if appropriate.

johnwcowan commented 8 years ago

@yinwang0: This is Scheme, where fixnum overflow (using the term "fixnum" broadly) can return an exact non-fixnum, or an inexact approximation, or raise an implementation-restriction error, depending on the implementation. This is true of all standards from R4RS onwards at least, and most . See http://trac.sacrideo.us/wg/wiki/NumericTower for what many existing Schemes actually do.

NalaGinrut commented 8 years ago

@hyln9 @akeep I think this fib case is too simple to take advantage of DCE, so maybe raise another issue?

NalaGinrut commented 8 years ago

Personally, I'm not interested in porting libgmp to Chez. What I'm afraid is that is that very big number operation is optimized to be faster while small number is slower for a compromise. For Chez, I think the strategy could be less extreme. Maybe we could find a way to improve big number operation in a limited expectation (say, not as fast as libgmp), but it shouldn't drag the small number operation. Well, I confess it sounds idealism. Anyway, if we can't find a way to improve big number without effecting small number, I prefer keep it as it is.

hyln9 commented 8 years ago

@NalaGinrut

Chez obviously do DCE, so it is whether a bug or misunderstanding, rather than a performance issue which needs complex benchmark.

As far as I know, there is no evidence which shows that GMP has drawbacks on relatively small (i.e. mid-size) numbers. I agree with your opinions on performance but fixnum is managed by compiler itself without overhead while there are numbers of algorithms or strategies on mid-size bignum operations.

NalaGinrut commented 8 years ago

@hyln9 Alright, I've taken a look again, GMP use different algorithm for different operand size. So maybe not the case I'm afraid of. I apologize for confusing here.

yinwang0 commented 8 years ago

I looked at the code of Chez's number.c and wonder if there is a simple way of make addition faster for 64-bit machines.

I see that the bigit type is always defined as U32, and big_add operates on bigits, so it's not going to use 64-bit addition instruction, thus is doing twice as many additions.

hyln9 commented 8 years ago

@yinwang0 In addition, since adc instruction (or intrinsic) is widely available, the EADDC macro is unnecessary. In the end we are getting farther and farther from standard c…

yinwang0 commented 8 years ago

@hyln9 Indeed, it is the ADC instruction doing the trick in GMP. I played with GMP's configuration again, and found that Ikarus can't use GMP's 64-bit ABI. I had to compile GMP with ABI=32, so Ikarus was not utilizing the 64-bit instructions. But it's still a lot faster because it's using x86's ADC instruction.

The code that is in actual use by Ikarus on my machine is in this file:

mpn/x86/p6/aors_n.asm

It's symbol linked to mpn/add_n.asm after running ./configure ABI=32.

The assembly code looks like this, where ADCSBB is defined to be adc earlier. It's just repeatedly calling adc.

define(ADCSBB,        adc)
... ...

L(top):
    jecxz   L(end)
L(ent):
Zdisp(  mov,    0,(up,n,4), %eax)
Zdisp(  ADCSBB, 0,(vp,n,4), %eax)
Zdisp(  mov,    %eax, 0,(rp,n,4))

    mov 4(up,n,4), %edx
    ADCSBB  4(vp,n,4), %edx
    mov %edx, 4(rp,n,4)

    mov 8(up,n,4), %eax
    ADCSBB  8(vp,n,4), %eax
    mov %eax, 8(rp,n,4)

    mov 12(up,n,4), %edx
    ADCSBB  12(vp,n,4), %edx
    mov %edx, 12(rp,n,4)

    mov 16(up,n,4), %eax
    ADCSBB  16(vp,n,4), %eax
    mov %eax, 16(rp,n,4)

    mov 20(up,n,4), %edx
    ADCSBB  20(vp,n,4), %edx
    mov %edx, 20(rp,n,4)

    mov 24(up,n,4), %eax
    ADCSBB  24(vp,n,4), %eax
    mov %eax, 24(rp,n,4)

    mov 28(up,n,4), %edx
    ADCSBB  28(vp,n,4), %edx
    mov %edx, 28(rp,n,4)

    lea 8(n), n
    jmp L(top)
hyln9 commented 8 years ago

@yinwang0

GMP's implementation of addition above combines adc with loop unrolling, which increases the ratio of arithmetic instructions.

On the other hand 64bit and 32 bit adc have same throughput on x86_64, therefore double the potential performance.

But all these might be off-topic.

dybvig commented 8 years ago

It would be great for someone to add the requisite ifdefs and asm instructions to use more efficient operators, and for someone to implement fancier algorithms, but please be careful in the process not to copy (or even study) code from GNU-licensed systems like GMP and Ikarus. Though Aziz would probably be willing to put the portions of the code you want to use under the Apache 2.0 license.

hyln9 commented 8 years ago

@dybvig

Yeah, that the reason why I haven't look at GMP's code yet except for the above one, so I cannot know GMP's algorithms. My major references are wikipedia pages and academic papers without code (since I am an undergraduate I do have access). Considering the known nine-line code court case we indeed need a clean room implementation.

As for Ikarus, things might be different because the techniques are different.

yinwang0 commented 8 years ago

@hyln9 GMP detected my processor as Haswell correctly and configured to use 64bit instructions and registers. It's just Ikarus 0.0.3 that can't use 64bit. So it's using adc with unrolling only with addition. It's using some mmx and sse instructions with other operations. Addition seems fast enough with just ADC and easy to implement with GCC's extended asm.

hyln9 commented 8 years ago

@yinwang0 It seems that there are no SIMD instructions for adding with carry on x86 yet.

yinwang0 commented 8 years ago

I doubt there will be SIMD addition with carry because the carry propagation is hard to do in parallel. If you suceed then you effectively made a 128 bit machine or higher.

http://stackoverflow.com/questions/27923192/practical-bignum-avx-sse-possible

NalaGinrut commented 8 years ago

It sounds like we have to rewrite all bignum operations and maintain related assembly code for all supported platforms, which is already done by libgmp. Take an example of addition operation. If we don't care about x86, maybe just rewrite EADDC is enough (no?). For X86 or certain platform doesn't contain needed AVX in the future, it's necessary to keep current EADDC. Well, maybe libgmp is still an option.

yinwang0 commented 8 years ago

I wrote a small demo program using GCC's extended assembly and the ADC instruction.

https://gist.github.com/yinwang0/290f34bb567a896eada4745173aa4477

The main part of the demo is exactly following the names in big_add_pos so it's easy to swap the code into Chez Scheme.

The demo itself seems to be correct, but after swapping the code into big_add_pos and rebuild Chez Scheme after "make clean", I got some error saying "nonrecoverable invalid memory reference".

It looks like a good starting point. I'm going to look more, but to make the development "parallel", some of you may want to try and find out how to make it work.

For your convenience of offering help, I committed the changes to my fork:

https://github.com/yinwang0/ChezScheme/commits/improve-big-add

yinwang0 commented 8 years ago

@NalaGinrut It looks Scheme doesn't provide that many bignum operators, and the open sourced Chez Scheme doesn't support that many architectures. I think it's worthwhile enough if we can just make the few operations fast under x86, because that's what most people use.