A faster log function - Githubissues

simonbyrne commented 10 years ago

The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.

I've implemented the Tang (1990) table-based algorithm here: https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl

It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?

For reference:

the OS X (10.9) system libm is still about 20-30% faster. I'm not sure what algorithm they use as they no longer release their libm code, but the code from 10.7 seems to use the Tang algorithm. I'm not sure which of the following is actually used:
- http://www.opensource.apple.com/source/Libm/Libm-2026/Source/Intel/log_universal.h
- http://www.opensource.apple.com/source/Libm/Libm-2026/Source/Intel/xmm_log.c
the glibc libm is about 3x slower (they use code based on the IBM Accurate Mathematical Library, which is supposed to be correctly rounded)

Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).

nalimilan commented 10 years ago

If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools.

ViralBShah commented 10 years ago

I also found the Apple log to be faster in tests I had done a while back when I was putting openlibm together.

ViralBShah commented 10 years ago

Have you compared with Intel's log in MKL? They may also have fast vectorized versions. Perhaps @ArchRobison knows something about this.

simonbyrne commented 10 years ago

@nalimilan I have a very, very rough test here: https://github.com/simonbyrne/libm.jl/blob/master/test/perf/log.jl But we probably should use something more rigorous (e.g. testing different ranges).

@ViralBShah I don't have access to MKL, so have not compared it.

ViralBShah commented 10 years ago

@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM.

simonbyrne commented 10 years ago

Here's the timings using the above test script on julia.mit.edu:

Openlibm:
elapsed time: 0.366460477 seconds (13824 bytes allocated)

System:
elapsed time: 0.752183874 seconds (96 bytes allocated)

Intel math:
elapsed time: 0.597063228 seconds (96 bytes allocated)

Julia fdlibm:
elapsed time: 0.368376388 seconds (96 bytes allocated)

Julia tang:
elapsed time: 0.284002105 seconds (96 bytes allocated)

(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out).

So that's a 25% boost over openlibm (much more than on my machine).

simonbyrne commented 10 years ago

@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips?

ViralBShah commented 10 years ago

Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM.

andreasnoack commented 10 years ago

Have a look here: https://github.com/simonster/VML.jl

ViralBShah commented 10 years ago

Perhaps worth also looking at Yeppp.

simonbyrne commented 10 years ago

Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials.

ViralBShah commented 10 years ago

That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too.

nalimilan commented 10 years ago

@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.

julia> @testsum "System:" syslog X
       @testsum "Openlibm:" log X
Warning: Possible conflict in library symbol log

System:
elapsed time: 0.463031635 seconds (64 bytes allocated)

Openlibm:
elapsed time: 0.455063121 seconds (64 bytes allocated)

Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs.

simonbyrne commented 10 years ago

@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)?

Perhaps the reason they refuse to release the source is that they've licensed a proprietary library?

simonbyrne commented 10 years ago

It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to?

staticfloat commented 10 years ago

@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a log function is already defined in the Julia process, and you're loading a library that redefines that function. I don't remember what we changed in the last year, but that warning might mean that the same log is being called in both cases.

nalimilan commented 10 years ago

Actually, the error only appears after calling log once, which I think loads openlibm. But the timings are the same before and after that, so the benchmark is probably fine. (Libm.jl isn't involved since I could not even install it on 0.2.1.)

simonbyrne commented 10 years ago

It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:

Algorithm	Max relative error (units in last place)
Openlibm	0.863
OS X 10.9 libm	0.508
Tang's algorithm	0.555 (theoretical bound of 0.566)

ntessore commented 10 years ago

I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically.

nolta commented 10 years ago

For (2), compile julia with USE_SYSTEM_LIBM=1.

ntessore commented 10 years ago

@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that.

nolta commented 10 years ago

If we implement log in Base, will we eventually abandon openlibm in favor of a pure julia implementation? I'm ok with that.

ViralBShah commented 10 years ago

That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries.

eschnett commented 10 years ago

Prompted by a different thread I began to port a SIMD math library to Julia. Currently, exp and log are available as a proof of concept. See https://github.com/eschnett/Vecmathlib.jl .

ViralBShah commented 10 years ago

This looks really cool!

simonbyrne commented 10 years ago

@ntessore if you're computing log on arrays, and you're on OS X, you might want to try out the Accelerate library that comes with the OS. It is blazingly fast compared with the standard log (from what I understand it uses SIMD operations and a bunch of extra threads), though does appear to sacrifice some accuracy. I have started putting together a package here:

https://github.com/simonbyrne/Accelerate.jl

It lacks documentation, but if you just call Accelerate.log you should get what you want (if not, please file an issue). Once I have wrapped it up I'll put it on METADATA, but I'd be grateful for any feedback.

ntessore commented 10 years ago

@simonbyrne Thanks for creating this, I will try and let you know if anything comes up.

Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary.

simonbyrne commented 9 years ago

I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values:

[(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]

which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go.

ViralBShah commented 9 years ago

I think we should be careful about this - since we may not be able to use the resulting code.

ViralBShah commented 9 years ago

The Yeppp implementationt talks about avoiding table lookups for SIMD performance. Unfortunately, there is no paper on the details of their implementation - only the talk. The source is available, and is all assembly.

simonbyrne commented 9 years ago

Well, that's basically the extent of my reverse engineering skills, and I don't think numbers are copyrightable, so we should be okay. Their approach intuitively makes sense: they use a slightly finer grained table, but can avoid the division (which messes up pipelining), and look up both points as a pair, which can be done in one step by exploiting the SIMD registers.

I tried translating the Yeppp code to Julia, but didn't see any real performance advantages: granted, I was using Float64 (which are harder vectorise), and don't have fma support on my machine, so it's tough to tell.

ViralBShah commented 9 years ago

If you still have that code, I would love to try it out.

Maratyszcza commented 9 years ago

Yeppp! approach with table-free implementations is specifically targeted for vector elementary functions, where only throughput matters. For scalar and short SIMD versions the most important performance characteristic is latency because even Haswell doesn't have enough entries in re-order buffer to extract enough parallelism from sequential calls to saturate execution units. I think that scalar/short SIMD versions should use a variant of Tang's method. If you want to port Yeppp! functions to Julia, I'd suggest to look at code in this file which should be easier to port.

simonbyrne commented 9 years ago

This was my attempt at a translation: https://github.com/simonbyrne/libm.jl/blob/master/src/log_yeppp.jl

ViralBShah commented 9 years ago

This is about 10x slower than the built-in log. Just had to change @horner to @evalpoly.

Maratyszcza commented 9 years ago

Proper loop unrolling and SIMD is a must to get good performance from Yeppp! algorithms. Without loop unrolling the execution is severely latency-bound, resulting in low ILP.

simonbyrne commented 9 years ago

@Maratyszcza I admit I just translated the code to check out the accuracy, I wasn't really trying to match performance.

I have to concede that I'm stumped as to how Apple implemented their algorithm. They only lookup one value of log(F) (as opposed to the pair that Tang's algorithm uses), and the error from this alone should exceed what they get in practice. They must have some other tricks going on there. Clearly they employ some very good numerical programmers.

StefanKarpinski commented 9 years ago

One value + Newton iteration, or something like that, maybe?

simonbyrne commented 9 years ago

That was my guess, but to do a Newton step on y=log(x), you need to compute an exp(y), so I don't really see how that helps...

StefanKarpinski commented 9 years ago

We could disassemble their code...

Maratyszcza commented 9 years ago

I'd guess they use a variant of Gal's accurate tables: the pivot points log(x) and 1/x are not in a grid, they are chosen in such way that log(x) is accurate to more then double precision (i.e. several bits after its double-precision value are zeroes)

simonbyrne commented 9 years ago

No, they're certainly evenly spaced, as they're exactly:

[(Float64(log(big(F))),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]

and the error of some of those values are up to 0.48 ulps (e.g. at 1.765625)

It could also be that the tables are for another function, but they do also have log2 and log10 versions as well.

I also did some timing analysis, and they are a bit quicker on the range [0.74805, 1.49605], which suggests that's the domain of their range reduction.

ViralBShah commented 9 years ago

I was trying to check if the polynomial evaluation in Yeppp benefits from SIMD and can give faster log computations on vectors. I extracted this from @simonbyrne's implementation of Yeppp's log. So far, the simd version is no faster than the regular one.

function loghorner!(b::Vector{Float64}, a::Vector{Float64})
    n = length(a)
    for i=1:n
        @inbounds b[i] = loghorner(a[i])
    end
end

function loghorner_simd!(b::Vector{Float64}, a::Vector{Float64})
    n = length(a)
    @simd for i=1:n
        @inbounds b[i] = loghorner(a[i])
    end
end

@inline function loghorner(x::Float64)
    Base.Math.@horner(x,
            -0x1.FFFFFFFFFFFF2p-2,
            0x1.5555555555103p-2,
            -0x1.00000000013C7p-2,
            0x1.9999999A43E4Fp-3,
            -0x1.55555554A6A2Bp-3,
            0x1.249248DAE4B2Ap-3,
            -0x1.FFFFFFBD8606Dp-4,
            0x1.C71C90DB06248p-4,
            -0x1.9999C5BE751E3p-4,
            0x1.745980F3FB889p-4,
            -0x1.554D5ACD502ABp-4,
            0x1.3B4ED39194B87p-4,
            -0x1.25480A82633AFp-4,
            0x1.0F23916A44515p-4,
            -0x1.EED2E2BB64B2Ep-5,
            0x1.EA17E14773369p-5,
            -0x1.1654764F478ECp-4,
            0x1.0266CD08DB2F2p-4,
            -0x1.CC4EC078138E3p-6)
end

Cc: @ArchRobison @aviks

Maratyszcza commented 9 years ago

Unrolling is necessary to extract sufficient parallelism in polynomial evaluation. Optimal unroll factor depends on the microarchitecture, e.g. 8 SIMD registers/iteration for Nehalem/Sandy Bridge/Ivy Bridge, 10 SIMD registers for Haswell, 5 AVX registers for Bulldozer/Piledriver/Streamroller. Maybe Julia needs a vector polynomial evaluation function? Note: LLVM does not detect/unroll latency-bound loops automatically.

simonster commented 9 years ago

@ViralBShah From the IR, it looks like LLVM already vectorizes (but doesn't partially unroll) loghorner! without @simd. The only difference is that the version with @simd elides a check for aliasing before the loop executes.

ViralBShah commented 9 years ago

@simonbyrne I did notice the vectorized operations in the IR, but was hoping that @simd would do a few more things.

@Maratyszcza It is a good idea to have a vector polynomial evaluation, especially if we get all the unrolling and stuff implemented right.

ViralBShah commented 9 years ago

Also, it would be nice to leverage the FMAs here.

ViralBShah commented 9 years ago

@simonbyrne Could you update the openlibm log implementation or provide a julia implementation that is faster? I am looking for the fastest log I can get right now, and even 10% improvement is worthwhile.

simonbyrne commented 9 years ago

I'll try to get a PR together this week. It would be nice to have the FMA stuff sorted before we merge it.

ViralBShah commented 9 years ago

I just benchmarked Yeppp, and it is about 8x faster for log and exp for vectors of size 10,000. Between 1000 to 100000 sized vectors, one gets anywhere between 2x to 8x. At smaller vector lengths, sometimes julia is faster and sometimes Yeppp - too much measurement noise in my command line tests.

JuliaLang / julia

A faster log function #8869