Open jmather-sesi opened 1 year ago
Some important information is missing. What operating system? What compiler/toolchain? What happens if -fno-gnu89-inline is removed from the command line? If you're using gcc, what happens if you use -march=native -mtune=native?
Speed isn't everything. Have you checked accuracy?
Ubuntu 22.04, gcc 11.4.0, running on a Neoverse V1 server. Removing fno-gnu89-inline made no difference, and specifying mtune and march also had no affect. I have run the test suite and everything passed. I also went into the test makefile and re-enabled building of test-float-system and test-double-system, and the following was produced:
$ ./test-double-system
testing double (without inline functions)
Failure: Test: cbrt (-27.0) == -3.0
Result:
is: -3.00000000000000044409e+00 -0x1.80000000000010000000p+1
should be: -3.00000000000000000000e+00 -0x1.80000000000000000000p+1
difference: 4.44089209850062616169e-16 0x1.00000000000000000000p-51
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: cbrt (0.970299) == 0.99
Result:
is: 9.90000000000000102141e-01 0x1.fae147ae147af0000000p-1
should be: 9.89999999999999991118e-01 0x1.fae147ae147ae0000000p-1
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: y0 (1.5) == 0.38244892379775884396
Result:
is: 3.82448923797758966181e-01 0x1.87a0b0d06836a0000000p-2
should be: 3.82448923797758855159e-01 0x1.87a0b0d0683680000000p-2
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 2.0000
max.ulp : 1.0000
Failure: Test: yn (0, 1.5) == 0.38244892379775884396
Result:
is: 3.82448923797758966181e-01 0x1.87a0b0d06836a0000000p-2
should be: 3.82448923797758855159e-01 0x1.87a0b0d0683680000000p-2
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 2.0000
max.ulp : 1.0000
Test suite completed:
1118 test cases plus 932 tests for exception flags executed.
4 errors occurred.
$ ./test-float-system
testing float (without inline functions)
Failure: Test: log10 (0.7) == -0.15490195998574316929
Result:
is: -1.54901981353759765625e-01 -0x1.3d3d4000000000000000p-3
should be: -1.54901966452598571777e-01 -0x1.3d3d3e00000000000000p-3
difference: 1.49011611938476562500e-08 0x1.00000000000000000000p-26
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: tgamma (4) == 6
Result:
is: 6.00000047683715820312e+00 0x1.80000200000000000000p+2
should be: 6.00000000000000000000e+00 0x1.80000000000000000000p+2
difference: 4.76837158203125000000e-07 0x1.00000000000000000000p-21
ulp : 1.0000
max.ulp : 0.0000
Test suite completed:
1101 test cases plus 923 tests for exception flags executed.
2 errors occurred.
I was however able to get slightly better results when compiling with clang 14.0:
openlibm-gcc | openlibm-clang
pow : 17.3955 MPS | pow : 17.0689 MPS
hypot : 4.7823 MPS | hypot : 5.4384 MPS
exp : 119.9932 MPS | exp : 123.1021 MPS
log : 97.0294 MPS | log : 110.2028 MPS
log10 : 73.0237 MPS | log10 : 81.5303 MPS
sin : 135.6112 MPS | sin : 149.5867 MPS
cos : 127.8527 MPS | cos : 138.9140 MPS
tan : 68.8512 MPS | tan : 86.8775 MPS
asin : 9.6621 MPS | asin : 12.8771 MPS
acos : 9.9192 MPS | acos : 13.0346 MPS
atan : 91.6229 MPS | atan : 101.9421 MPS
atan2 : 42.4876 MPS | atan2 : 47.1442 MPS
If you would like me to run further tests, I would be more than happy to do so.
Thanks!
Is there something we can do about this?
yes: use the CORE-MATH code, which has efficiency comparable to GNU libc (see https://core-math.gitlabpages.inria.fr/64.pdf) and delivers correct rounding.
Unfortunately, I cannot help at the source code level as I do not have access to an ARM server. There are newer versions of GCC and the release notes show new arm processors have been added as well as changes to code generation since 11.4.0 was released. You may want to try an updated GCC.
In your results, I would look at exp
and log
to see where the time is spent.
@zimmermann6 Unfortunately core-math compiles with errors on arm as it uses x86 intrinsics. We may be able to get around this by using a library to convert the calls into their corresponding NEON instructions, but I would worry that there could be some subtle differences that would throw off the results. Some tests also fail on arm, for example:
$ ./check.sh pow
Running worst cases check in --rndn mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndz mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndu mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0.0000000000001p-1022 z=-0x0p+0
Running worst cases check in --rndd mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0.0000000000001p-1022
Building with GCC 12.3.0 (which is the latest version in my package manager), the performance is either the same as it was in 11.4.0, or even slightly slower.
Hi @jmather-sesi I can reproduce on cfarm117, I will investigate.
this issue is fixed. For the record, it was due to a different conversion from the double value 0x1p64 to int64_t. I suggest we followup on core-math issues on the core-math mailing list.
I see very poor results on a modern ARM server. Some openlibm implementations are up to 48.68x slower than their libm counterparts. This was a checkout of the main branch at 12f5ffc This is also related to #234, but the performance difference seems to be even more dramatic.
For interest and potential usefulness to #203, I also compared it against an optimized build of musl 1.2.4:
GNU libc version: 2.35 GNU libc release: stable
The openlibm compilation line looks like:
I have tried compiling openlibm with just bare
make
, and also specifying the architecture directly withmake ARCH=aarch64
to identical results.Is there something we can do about this?