openlibm performance on ARM server is very poor

jmather-sesi commented 1 year ago

I see very poor results on a modern ARM server. Some openlibm implementations are up to 48.68x slower than their libm counterparts. This was a checkout of the main branch at 12f5ffc This is also related to #234, but the performance difference seems to be even more dramatic.

For interest and potential usefulness to #203, I also compared it against an optimized build of musl 1.2.4:

bench-syslibm              | bench-openlibm             | bench-musl
  pow     :   78.6387 MPS  |   pow     :   17.3955 MPS  |   pow     :   57.9493 MPS
  hypot   :  232.7852 MPS  |   hypot   :    4.7823 MPS  |   hypot   :  139.4793 MPS
  exp     :  317.8124 MPS  |   exp     :  119.9932 MPS  |   exp     :  215.2262 MPS
  log     :  228.3188 MPS  |   log     :   97.0294 MPS  |   log     :  181.7701 MPS
  log10   :  118.6787 MPS  |   log10   :   73.0237 MPS  |   log10   :   76.8402 MPS
  sin     :  133.0101 MPS  |   sin     :  135.6112 MPS  |   sin     :  165.5926 MPS
  cos     :  144.4003 MPS  |   cos     :  127.8527 MPS  |   cos     :  150.5435 MPS
  tan     :  105.8875 MPS  |   tan     :   68.8512 MPS  |   tan     :   78.9428 MPS
  asin    :  178.2302 MPS  |   asin    :    9.6621 MPS  |   asin    :   88.3722 MPS
  acos    :  154.1304 MPS  |   acos    :    9.9192 MPS  |   acos    :   98.5818 MPS
  atan    :  190.8853 MPS  |   atan    :   91.6229 MPS  |   atan    :   97.0451 MPS
  atan2   :   56.6821 MPS  |   atan2   :   42.4876 MPS  |   atan2   :   47.6644 MPS

GNU libc version: 2.35 GNU libc release: stable

The openlibm compilation line looks like:

cc   -fno-gnu89-inline -fno-builtin -O3 -fPIC  -std=c99 -Wall -I/home/user/openlibm -I/home/user/openlibm/include -I/home/user/openlibm/aarch64 -I/home/user/openlibm/src -DASSEMBLER -D__BSD_VISIBLE -Wno-implicit-function-declaration -I/home/user/openlibm/ld128 -c src/e_j0.c -o src/e_j0.c.o

I have tried compiling openlibm with just bare make, and also specifying the architecture directly with make ARCH=aarch64 to identical results.

Is there something we can do about this?

kargl commented 1 year ago

Some important information is missing. What operating system? What compiler/toolchain? What happens if -fno-gnu89-inline is removed from the command line? If you're using gcc, what happens if you use -march=native -mtune=native?

Speed isn't everything. Have you checked accuracy?

jmather-sesi commented 1 year ago

Ubuntu 22.04, gcc 11.4.0, running on a Neoverse V1 server. Removing fno-gnu89-inline made no difference, and specifying mtune and march also had no affect. I have run the test suite and everything passed. I also went into the test makefile and re-enabled building of test-float-system and test-double-system, and the following was produced:

$ ./test-double-system
testing double (without inline functions)
Failure: Test: cbrt (-27.0) == -3.0
Result:
 is:         -3.00000000000000044409e+00  -0x1.80000000000010000000p+1
 should be:  -3.00000000000000000000e+00  -0x1.80000000000000000000p+1
 difference:  4.44089209850062616169e-16   0x1.00000000000000000000p-51
 ulp       :  1.0000
 max.ulp   :  0.0000
Failure: Test: cbrt (0.970299) == 0.99
Result:
 is:          9.90000000000000102141e-01   0x1.fae147ae147af0000000p-1
 should be:   9.89999999999999991118e-01   0x1.fae147ae147ae0000000p-1
 difference:  1.11022302462515654042e-16   0x1.00000000000000000000p-53
 ulp       :  1.0000
 max.ulp   :  0.0000
Failure: Test: y0 (1.5) == 0.38244892379775884396
Result:
 is:          3.82448923797758966181e-01   0x1.87a0b0d06836a0000000p-2
 should be:   3.82448923797758855159e-01   0x1.87a0b0d0683680000000p-2
 difference:  1.11022302462515654042e-16   0x1.00000000000000000000p-53
 ulp       :  2.0000
 max.ulp   :  1.0000
Failure: Test: yn (0, 1.5) == 0.38244892379775884396
Result:
 is:          3.82448923797758966181e-01   0x1.87a0b0d06836a0000000p-2
 should be:   3.82448923797758855159e-01   0x1.87a0b0d0683680000000p-2
 difference:  1.11022302462515654042e-16   0x1.00000000000000000000p-53
 ulp       :  2.0000
 max.ulp   :  1.0000

Test suite completed:
  1118 test cases plus 932 tests for exception flags executed.
  4 errors occurred.

$ ./test-float-system
testing float (without inline functions)
Failure: Test: log10 (0.7) == -0.15490195998574316929
Result:
 is:         -1.54901981353759765625e-01  -0x1.3d3d4000000000000000p-3
 should be:  -1.54901966452598571777e-01  -0x1.3d3d3e00000000000000p-3
 difference:  1.49011611938476562500e-08   0x1.00000000000000000000p-26
 ulp       :  1.0000
 max.ulp   :  0.0000
Failure: Test: tgamma (4) == 6
Result:
 is:          6.00000047683715820312e+00   0x1.80000200000000000000p+2
 should be:   6.00000000000000000000e+00   0x1.80000000000000000000p+2
 difference:  4.76837158203125000000e-07   0x1.00000000000000000000p-21
 ulp       :  1.0000
 max.ulp   :  0.0000

Test suite completed:
  1101 test cases plus 923 tests for exception flags executed.
  2 errors occurred.

I was however able to get slightly better results when compiling with clang 14.0:

openlibm-gcc               | openlibm-clang
  pow     :   17.3955 MPS  |   pow     :   17.0689 MPS
  hypot   :    4.7823 MPS  |   hypot   :    5.4384 MPS
  exp     :  119.9932 MPS  |   exp     :  123.1021 MPS
  log     :   97.0294 MPS  |   log     :  110.2028 MPS
  log10   :   73.0237 MPS  |   log10   :   81.5303 MPS
  sin     :  135.6112 MPS  |   sin     :  149.5867 MPS
  cos     :  127.8527 MPS  |   cos     :  138.9140 MPS
  tan     :   68.8512 MPS  |   tan     :   86.8775 MPS
  asin    :    9.6621 MPS  |   asin    :   12.8771 MPS
  acos    :    9.9192 MPS  |   acos    :   13.0346 MPS
  atan    :   91.6229 MPS  |   atan    :  101.9421 MPS
  atan2   :   42.4876 MPS  |   atan2   :   47.1442 MPS

If you would like me to run further tests, I would be more than happy to do so.

Thanks!

zimmermann6 commented 1 year ago

Is there something we can do about this?

yes: use the CORE-MATH code, which has efficiency comparable to GNU libc (see https://core-math.gitlabpages.inria.fr/64.pdf) and delivers correct rounding.

kargl commented 1 year ago

Unfortunately, I cannot help at the source code level as I do not have access to an ARM server. There are newer versions of GCC and the release notes show new arm processors have been added as well as changes to code generation since 11.4.0 was released. You may want to try an updated GCC.

In your results, I would look at exp and log to see where the time is spent.

jmather-sesi commented 1 year ago

@zimmermann6 Unfortunately core-math compiles with errors on arm as it uses x86 intrinsics. We may be able to get around this by using a library to convert the calls into their corresponding NEON instructions, but I would worry that there could be some subtle differences that would throw off the results. Some tests also fail on arm, for example:

$ ./check.sh pow
Running worst cases check in --rndn mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndz mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndu mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0.0000000000001p-1022 z=-0x0p+0
Running worst cases check in --rndd mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0.0000000000001p-1022

Building with GCC 12.3.0 (which is the latest version in my package manager), the performance is either the same as it was in 11.4.0, or even slightly slower.

zimmermann6 commented 1 year ago

Hi @jmather-sesi I can reproduce on cfarm117, I will investigate.

zimmermann6 commented 12 months ago

this issue is fixed. For the record, it was due to a different conversion from the double value 0x1p64 to int64_t. I suggest we followup on core-math issues on the core-math mailing list.

JuliaMath / openlibm

openlibm performance on ARM server is very poor #282