Poor performance when dealing with multi-precision numbers

ChaosEternal commented 8 years ago

The test Program:

;; the fib iterative (import (rnrs))

(define (fib n) (define (iter a b c) (cond ((= c 0) b) (#t (iter (+ a b) a (- c 1))))) (iter 1 0 n) )

(display (fib 1000000))

Run with Chez:

$ time scheme --program fib.scm >/dev/null

real 0m23.007s user 0m22.956s sys 0m0.076s

Run with vicare Vicare Scheme version 0.4d0, 64-bit Build 2015-10-19

$ time vicare fib.scm >/dev/null

real 0m6.372s user 0m5.900s sys 0m0.468s

johnwcowan commented 8 years ago

The operations on bignums that are needed are those listed in section 11.7.4 of R6RS that are meaningful when applied to exact integers, namely inexact = < > <= >= zero? positive? negative? odd? even? max min + - * / div mod div-and-mod div0 mod0 div0-and-mod0 gcd lcm expt number->string string->number.

NalaGinrut commented 8 years ago

@yinwang0 I haven't read the code carefully, but the inlined assembly code contains side-effects, it is generally to add volatile or __volatile__ to avoid the compiler breaking assembly code block while optimizing. https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#ss5.4 Or it may not safe:

[nalaginrut@debian:tmp] cc ba.c -o ba
[nalaginrut@debian:tmp] ./ba
z[0] = 1
z[1] = 0
z[2] = 0
z[3] = 0
z[4] = 0
[nalaginrut@debian:tmp] cc ba.c -o ba -O3
[nalaginrut@debian:tmp] ./ba
z[0] = 0
z[1] = 0
z[2] = 0
z[3] = 0
z[4] = 0

@johnwcowan Maybe we don't have to optimize them all, and it is necessary to do more benchmarks first.

yinwang0 commented 8 years ago

@NalaGinrut Indeed not having __volatile__ is one of the reasons that it's not right. Adding __volatile__ and I see other kinds of issue. But it seems I'm getting closer to get it work.

alexshpilkin commented 8 years ago

@NalaGinrut @yinwang0 On an inline assembly block, volatile basically means that the block has effects other than those explicitly indicated to GCC using the extended asm syntax input and output specifiers; indeed one of the reasons for the extended syntax is to allow GCC to do CSE, DCE and similar optimizations on inline assembly blocks. Writing asm volatile is basically giving up on explaining to the compiler what the assembly block does (think mov cr3); I don’t see how this should be required here. A smaller but still probably excessive hammer is the memory output.

On the other hand, it looks like the function signature may need some restrict qualifiers (i.e. the code doesn’t work correctly unless the pointers don’t alias each other), but I’ll leave that to the person that wrote it. (If you require GCC/ICC/clang you might as well assume support for restrict I think.)

yinwang0 commented 8 years ago

@alex-shpilkin I don't see why it's needed there but I got wrong results with gcc optimizations (-O1 and above) if there is no __volatile__. Maybe that's a hint why the code is still not correct.

yinwang0 commented 8 years ago

I fixed bugs in my code and now it can be built with Chez and produce correct results. The __volatile__ can be removed and still produce correct results with -O3.

The code is here in my fork:

https://github.com/yinwang0/ChezScheme/commits/improve-big-add

It's basically a translation of Chez's original code, but using ADC instruction to handle the carry.

Unfortunately the performance of the assembly code is not even as good as Chez's original C code. The timing is about 28s (my change) vs 17s (chez's original). So I'm no longer sure about ADC instruction is the real reason that GMP is fast. We could do some loop unrolling and see if that helps.

yinwang0 commented 8 years ago

Slowness fixed. That's because I used x86's loop instruction. Changed loop control to use jecxz and lea, and now it's faster than Chez's original code, although not much: 14s vs 18s.

I have no idea why loop instruction is so slow, maybe just old circuit leftover in the processor ;)

alexshpilkin commented 8 years ago

@yinwang0 Indeed, as Agner Fog’s tables show, LOOP (which existed in 8086 already) is a whopping 11 cycles on Sandy Bridge compared to two cycles for JECXZ (introduced in 80486 IIRC). In turn, JECXZ is not needed, because (recent Intel) processors are able to fuse the CMP/JNE pair into one microinstruction internally.

Most probably the loop would benefit from unrolling, but the degree of it is to be determined empirically, preferably on benchmarks with realistic number magnitudes. (Would the tail dominate? Can we just declare that bignums are allocated in 4x32/64 chunks? Should we align to 128-bit memory fetch lines [C11 alignas]? &c.)

yinwang0 commented 8 years ago

@alex-shpilkin I unrolled the loops and adjusted the way for indexing into the arrays. Now the timing is reduced to about 10.5s (as compared to its original 18s). It's still a little long as compared to Ikarus's 6.75s.

I can't use CMP / JNE because CMP may change the CF flag, which ADC will need. It's quite some tricks that it does by using LEA for addition into ECX, and JECXZ for looping.

The unrolling is done only 2x. I tried to unroll the loops using 4x32 chunks but it didn't improve over 2x32, so I'm currently keeping it at 2x just for clarity of the code. Some other things can be done for example using 64-bit addition. Also it needs some #if's for targeting correct architecture.

yinwang0 commented 8 years ago

@dybvig I got some endianness problems when trying to use 64-bit instructions. Since the 32-bit chunks of bignums are stored in a big-endian way, I got wrong byte order when using movq instruction of x86 which assumes little-endian. I could do my own byte order swapping back and forth, but this may not be efficient.

I wonder if Chez already conditionally handles the endianness regarding bignums? Is there an easy variable to change so can I switch it to 64-bit mode?

hyln9 commented 8 years ago

@NalaGinrut ADC has nothing to do with AVX, but indeed EADDC should be preserved for fallback. @yinwang0 Currently bigit_bits is always 32 regardless of machine type.

hyln9 commented 8 years ago

I'd like to discuss some design decisions here before going any further.

I'd prefer to use assembly source file directly instead of inline assembly or intrinsics which is not portable (e.g. x86_64 inline assembly is not available in msvc), stable (depends on compiler behavior) or optimal (e.g. adc intrinsics are not well-optimized). We can create a folder called "a" and put assembly code into different folders distinguished by machine types.

In c sources, we can use ifdefs to tell whether a function is implemented in asm or c.

yinwang0 commented 8 years ago

@hyln9 Indeed inline assembly is not as good as separate assembly files. My implementation can be considered a proof of concept that ADC can run faster. GCC's inline assembly is also quirky to get right, and is thus just for small snippets. It's better to use a portable assembly language line NASM to write those.

NalaGinrut commented 8 years ago

@hyln9 I thought we may choose AVX for bignum, but seems it's more inconvenient than ADC. And libgmp proves ADC is enough. My bad English. I agree with you to create folder for assembly code. But maybe we could find better way to optimize the current prove of concept first. It is not as fast as expected.

yinwang0 commented 8 years ago

@NalaGinrut You are right. My code is still not as fast as GMP's although they look very similar, so you may want to play with it and see if there is easy way to improve. I also unrolled it 4 and 8 times, but it didn't improve the performance. I don't know yet what other tricks are in GMP for addition.

hyln9 commented 8 years ago

@yinwang0 After a quick look at number.c, I found that maybe the copy operation of copy_normalize is not optimized.

yinwang0 commented 8 years ago

Tried to use ROL and ROR for switching endians, but unfortunately they change the carry flag CF. Keep in mind the trick with ADC lies in that you can't use instructions that can change the CF flag between the ADCs.

So it seems we really need to store the bignum in different order with 64-bit machines.

hyln9 commented 8 years ago

@akeep After some investigations into Ikarus, I think the DCE issue is actually a loop-invariant hoisting one.

yinwang0 commented 8 years ago

Actually I don't see where this code is related to DCE.

hyln9 commented 8 years ago

@yinwang0 It has no return values or side effects.

yinwang0 commented 8 years ago

@hyln9 Are we talking about the same piece of code?

(define fib-it
  (lambda (n)
    (define iter
      (lambda (a b n)
        (cond
         [(= n 0) b]
         [else
          (iter (+ a b) a (- n 1))])))
    (iter 1 0 n)))

(time (display (< 1 (fib-it 1000000))))

I don't see where the dead code is.

hyln9 commented 8 years ago

@yinwang0 @akeep Sorry for the confusion I made. I meant this: https://github.com/cisco/ChezScheme/issues/14#issuecomment-215851686

dybvig commented 8 years ago

I'm going to take the opportunity in the lull between posts to close this issue and suggest that further discussion either be held privately amongst the thread contributors or on the mailing list.

cisco / ChezScheme

Poor performance when dealing with multi-precision numbers #14