Closed simonbyrne closed 4 years ago
If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools.
I also found the Apple log
to be faster in tests I had done a while back when I was putting openlibm
together.
Have you compared with Intel's log
in MKL? They may also have fast vectorized versions. Perhaps @ArchRobison knows something about this.
@nalimilan I have a very, very rough test here: https://github.com/simonbyrne/libm.jl/blob/master/test/perf/log.jl But we probably should use something more rigorous (e.g. testing different ranges).
@ViralBShah I don't have access to MKL, so have not compared it.
@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM.
Here's the timings using the above test script on julia.mit.edu:
Openlibm:
elapsed time: 0.366460477 seconds (13824 bytes allocated)
System:
elapsed time: 0.752183874 seconds (96 bytes allocated)
Intel math:
elapsed time: 0.597063228 seconds (96 bytes allocated)
Julia fdlibm:
elapsed time: 0.368376388 seconds (96 bytes allocated)
Julia tang:
elapsed time: 0.284002105 seconds (96 bytes allocated)
(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out).
So that's a 25% boost over openlibm (much more than on my machine).
@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips?
Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM.
Have a look here: https://github.com/simonster/VML.jl
Perhaps worth also looking at Yeppp.
Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials.
That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too.
@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.
julia> @testsum "System:" syslog X
@testsum "Openlibm:" log X
Warning: Possible conflict in library symbol log
System:
elapsed time: 0.463031635 seconds (64 bytes allocated)
Openlibm:
elapsed time: 0.455063121 seconds (64 bytes allocated)
Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs.
@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)?
Perhaps the reason they refuse to release the source is that they've licensed a proprietary library?
It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to?
@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a log
function is already defined in the Julia process, and you're loading a library that redefines that function. I don't remember what we changed in the last year, but that warning might mean that the same log
is being called in both cases.
Actually, the error only appears after calling log
once, which I think loads openlibm. But the timings are the same before and after that, so the benchmark is probably fine. (Libm.jl isn't involved since I could not even install it on 0.2.1.)
It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:
Algorithm | Max relative error (units in last place) |
---|---|
Openlibm | 0.863 |
OS X 10.9 libm | 0.508 |
Tang's algorithm | 0.555 (theoretical bound of 0.566) |
I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically.
For (2), compile julia with USE_SYSTEM_LIBM=1
.
@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that.
If we implement log
in Base, will we eventually abandon openlibm in favor of a pure julia implementation? I'm ok with that.
That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries.
Prompted by a different thread I began to port a SIMD math library to Julia. Currently, exp
and log
are available as a proof of concept. See https://github.com/eschnett/Vecmathlib.jl .
This looks really cool!
@ntessore if you're computing log
on arrays, and you're on OS X, you might want to try out the Accelerate library that comes with the OS. It is blazingly fast compared with the standard log (from what I understand it uses SIMD operations and a bunch of extra threads), though does appear to sacrifice some accuracy. I have started putting together a package here:
https://github.com/simonbyrne/Accelerate.jl
It lacks documentation, but if you just call Accelerate.log
you should get what you want (if not, please file an issue). Once I have wrapped it up I'll put it on METADATA, but I'd be grateful for any feedback.
@simonbyrne Thanks for creating this, I will try and let you know if anything comes up.
Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary.
I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values:
[(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]
which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go.
I think we should be careful about this - since we may not be able to use the resulting code.
The Yeppp implementationt talks about avoiding table lookups for SIMD performance. Unfortunately, there is no paper on the details of their implementation - only the talk. The source is available, and is all assembly.
Well, that's basically the extent of my reverse engineering skills, and I don't think numbers are copyrightable, so we should be okay. Their approach intuitively makes sense: they use a slightly finer grained table, but can avoid the division (which messes up pipelining), and look up both points as a pair, which can be done in one step by exploiting the SIMD registers.
I tried translating the Yeppp code to Julia, but didn't see any real performance advantages: granted, I was using Float64 (which are harder vectorise), and don't have fma support on my machine, so it's tough to tell.
If you still have that code, I would love to try it out.
Yeppp! approach with table-free implementations is specifically targeted for vector elementary functions, where only throughput matters. For scalar and short SIMD versions the most important performance characteristic is latency because even Haswell doesn't have enough entries in re-order buffer to extract enough parallelism from sequential calls to saturate execution units. I think that scalar/short SIMD versions should use a variant of Tang's method. If you want to port Yeppp! functions to Julia, I'd suggest to look at code in this file which should be easier to port.
This was my attempt at a translation: https://github.com/simonbyrne/libm.jl/blob/master/src/log_yeppp.jl
This is about 10x slower than the built-in log
. Just had to change @horner
to @evalpoly
.
Proper loop unrolling and SIMD is a must to get good performance from Yeppp! algorithms. Without loop unrolling the execution is severely latency-bound, resulting in low ILP.
@Maratyszcza I admit I just translated the code to check out the accuracy, I wasn't really trying to match performance.
I have to concede that I'm stumped as to how Apple implemented their algorithm. They only lookup one value of log(F)
(as opposed to the pair that Tang's algorithm uses), and the error from this alone should exceed what they get in practice. They must have some other tricks going on there. Clearly they employ some very good numerical programmers.
One value + Newton iteration, or something like that, maybe?
That was my guess, but to do a Newton step on y=log(x)
, you need to compute an exp(y)
, so I don't really see how that helps...
We could disassemble their code...
I'd guess they use a variant of Gal's accurate tables: the pivot points log(x)
and 1/x
are not in a grid, they are chosen in such way that log(x)
is accurate to more then double precision (i.e. several bits after its double-precision value are zeroes)
No, they're certainly evenly spaced, as they're exactly:
[(Float64(log(big(F))),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]
and the error of some of those values are up to 0.48 ulps (e.g. at 1.765625
)
It could also be that the tables are for another function, but they do also have log2
and log10
versions as well.
I also did some timing analysis, and they are a bit quicker on the range [0.74805, 1.49605]
, which suggests that's the domain of their range reduction.
I was trying to check if the polynomial evaluation in Yeppp benefits from SIMD and can give faster log computations on vectors. I extracted this from @simonbyrne's implementation of Yeppp's log. So far, the simd version is no faster than the regular one.
function loghorner!(b::Vector{Float64}, a::Vector{Float64})
n = length(a)
for i=1:n
@inbounds b[i] = loghorner(a[i])
end
end
function loghorner_simd!(b::Vector{Float64}, a::Vector{Float64})
n = length(a)
@simd for i=1:n
@inbounds b[i] = loghorner(a[i])
end
end
@inline function loghorner(x::Float64)
Base.Math.@horner(x,
-0x1.FFFFFFFFFFFF2p-2,
0x1.5555555555103p-2,
-0x1.00000000013C7p-2,
0x1.9999999A43E4Fp-3,
-0x1.55555554A6A2Bp-3,
0x1.249248DAE4B2Ap-3,
-0x1.FFFFFFBD8606Dp-4,
0x1.C71C90DB06248p-4,
-0x1.9999C5BE751E3p-4,
0x1.745980F3FB889p-4,
-0x1.554D5ACD502ABp-4,
0x1.3B4ED39194B87p-4,
-0x1.25480A82633AFp-4,
0x1.0F23916A44515p-4,
-0x1.EED2E2BB64B2Ep-5,
0x1.EA17E14773369p-5,
-0x1.1654764F478ECp-4,
0x1.0266CD08DB2F2p-4,
-0x1.CC4EC078138E3p-6)
end
Cc: @ArchRobison @aviks
Unrolling is necessary to extract sufficient parallelism in polynomial evaluation. Optimal unroll factor depends on the microarchitecture, e.g. 8 SIMD registers/iteration for Nehalem/Sandy Bridge/Ivy Bridge, 10 SIMD registers for Haswell, 5 AVX registers for Bulldozer/Piledriver/Streamroller. Maybe Julia needs a vector polynomial evaluation function? Note: LLVM does not detect/unroll latency-bound loops automatically.
@ViralBShah From the IR, it looks like LLVM already vectorizes (but doesn't partially unroll) loghorner!
without @simd
. The only difference is that the version with @simd
elides a check for aliasing before the loop executes.
@simonbyrne I did notice the vectorized operations in the IR, but was hoping that @simd
would do a few more things.
@Maratyszcza It is a good idea to have a vector polynomial evaluation, especially if we get all the unrolling and stuff implemented right.
Also, it would be nice to leverage the FMAs here.
@simonbyrne Could you update the openlibm log
implementation or provide a julia implementation that is faster? I am looking for the fastest log
I can get right now, and even 10% improvement is worthwhile.
I'll try to get a PR together this week. It would be nice to have the FMA stuff sorted before we merge it.
I just benchmarked Yeppp, and it is about 8x faster for log
and exp
for vectors of size 10,000. Between 1000 to 100000 sized vectors, one gets anywhere between 2x to 8x. At smaller vector lengths, sometimes julia is faster and sometimes Yeppp - too much measurement noise in my command line tests.
The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.
I've implemented the Tang (1990) table-based algorithm here: https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl
It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?
For reference:
Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).