On a stratergy for Porting SLEEF

oxinabox commented 8 years ago

So one of the options would be to port SLEEF; by Naoki Shibata (repo), and Hal Finkel (repo).

My preference is that we should seriously look into porting something like sleef. The approach there should yield code that lends itself to auto-vectorization.

I dug through the SLEEF code on the weekend, thinking about this.

SLEEF has two implementations, each in a folder in its repo. purec which does not feature SIMD magic, and simd which does. In general they are very similar in implementation, it just comes down to whether the lowest level operations use SIMD (AFAICTICBWT). In both case, the library is Branch-Free, and has no constant tables -- which is very different from OpenLibm.

So really we could implement port the simd, or the purec implementation. Or if I was correct and it is just the low level operation we can have both by just swapping out one set of definitions for the other -- but it is probably not that simple as otherwise it would be done in Naoki Shibata's C code/makefiles.

If we trusted LLVM's autovectorization was the go, then porting the purec would be the go. Related on that note: https://groups.google.com/d/msg/llvm-dev/rvLlViuu2Aw/qotzAVlQDQAJ I'm not sure how closely related, in general sounds like SIMD is sensitive.

So thinking mostly about porting the simd implementation. I suggest that the various SIMD helperfiles (helperavx.h, helperavx2.h, helperfma4.h, helperneon.h, helperqpx.h, helpersse2.h) be kept as is in C for now, with a julia wrapper. They deal with all the different processor specific intrinsics for targeting different architectures, and that is a lot easier with Makefiles and C processor stuff than in Julia. And further to that, it seems a lot easier, to ccall the helpers, than to write ccalls for every different archecture's SIMD intrinsics ourselves. Later, perhaps they can can be reimplemented with something like SIMD.jl, SIMDVectors.jl, or llvm_calls or with enhancement's to julia's Base/codegen.

And thus the focus would be in porting: the main C files, for Float64: sleefsimddp.c and for Float32: sleefsimdsp.c Now because this actually fairly simple C code, due to it being branch free -- it is just a sequence of function calls. Its a good candidate for writing some code to assist translation (#5). (GCC -e -d $ARCH to is probably worth running)

On the otherhand we could just wait, and see if it is put into LLVM.. If it is then it will play nicer with LLVM's constant folding and other vectorisation infrastructure. Then it is just a matter of llvm_calls like in #8

See also:

LLVM Autovectorisation docs

simonbyrne commented 8 years ago

I haven't looked at SLEEF sufficiently closely, but my experience with vectorised math libraries is that they tend to sacrifice some accuracy for performance, for example the log function in openlibm is guaranteed to be accurate to < 1 ulp (in practice around 0.85 ulps) whereas the Apple Accelerate library seems to have errors up to about 1.2 ulps or so.

Things can be even more problematic in the extremes, e.g. for trig functions: to compute sin and cos for very large values (i.e. > 1e20 or so) you need to use a Payne-Hanek range reduction, which involves a very large table lookup (which is basically unvectorizable). Of course, you very rarely need to compute such quantities (and if you do, the results are probably not what you want since the epsilon is larger than the period of the function).

One option would be to only expose such reduced-accuracy vectorised ops via @fastmath or @simd macros.

ViralBShah commented 8 years ago

The benefit of having julia implementations is that we can have nice interfaces to trade off accuracy for speed, and doing so transparently in user code.

oxinabox commented 8 years ago

I haven't looked at SLEEF sufficiently closely, but my experience with vectorised math libraries is that they tend to sacrifice some accuracy for performance, for example the log function in openlibm is guaranteed to be accurate to < 1 ulp (in practice around 0.85 ulps) whereas the Apple Accelerate library seems to have errors up to about 1.2 ulps or so.

Related to that, one of the things that really concerns me about SLEEF is that it is not documented (at all), in particular that it is not documented for the source of its algorithms, or their accuracy. Where as OpenLibm (and its sources), are meticulously documented with their accuracys and references to the papers/textbooks where the algorithms was sourced.

I know if you run the test code it will spit out some accuracy information in ulps.

I ran that, results are here The short version is everything is within 1ulp, except for sin, cos, sincos, tan, asin, acos, atan, atan2, log, and cbrt (I think)

And for these there are sin_u1, cos_u1, sincos_u1, tan_u1, asin_u1, acos_u1, atan_u1, atan2_u1, log_u1, and cbrt_u1, which do come within 1ulp

The ranges that it is tested on is pretty solid, I think. It varies depending on the operation, but ranges like [-10:0.0002:10; -10_000_000:200.1:10_000_000; 1:10_000] seem to be all through the test files, for finding ulp accuracies.

musm commented 8 years ago

hi oxinabox good ideas! The ccall idea should be good way to test the waters, so we can benchmark benefits and potential downsides

oxinabox commented 8 years ago

I now have SLEEF tied in the ccalls. https://github.com/oxinabox/SLEEF.jl (Not 100% sure it is correct but it seems to be working and passing tests) Once that has some Benchmarking then we can benchmark a (your) julia implementation against it, and against OpenLibm.

musm commented 8 years ago

nice. do you plan on exposing ccal's into the intrinsics for the vectorized stuff, esp for the funs that double2 use?

oxinabox commented 8 years ago

nice. do you plan on exposing ccal's into the intrinsics for the vectorized stuff, esp for the funs that double2 use?

Not right now for sure, prob not for a few weeks, I should really get my actual research/work done. Also probably if I did, would do so in a LibM branch rather than in SLEEF.jl.

If you need them though, I can take a crack at it, (Prob not for a few days) or talk you though what little I've worked out so far of the interactions SIMD and ccall (Others certainly know more). Hit me up on the gitter.

musm commented 8 years ago

sounds good no rush on my end for the vectorized ver yet!

musm commented 8 years ago

actually @oxinabox all the double functions in julia are benefiting from llvm's autovectorization, so it's unlikely we need to perform any manual simd vectorization, which is great

JuliaMath / Libm.jl

On a stratergy for Porting SLEEF #10