Micro benchmarks for mandelbrot are misleading

KristofferC commented 7 years ago

In the micro benchmarks for mandelbrot:

https://github.com/JuliaLang/julia/blob/5295feb473e0e6ccf695c91bb45e05908ce381c8/test/perf/micro/perf.jl#L42-L52

changing the line:

if abs > 2

to

if sqrt(abs2) > 2

makes the benchmark go from 115.336 μs to 58.151 μs.

This is because Julia (like it should) is using the more accurate hypot function to compute abs which involves some branches. However, in some of the benchmarks the more simple function, equivalent to sqrt(abs2) is used:

https://github.com/JuliaLang/julia/blob/5295feb473e0e6ccf695c91bb45e05908ce381c8/test/perf/micro/perf.js#L62

https://github.com/JuliaLang/julia/blob/5295feb473e0e6ccf695c91bb45e05908ce381c8/test/perf/micro/java/src/main/java/PerfPure.java#L318

Perhaps all implementations should just use sqrt(abs2) to not penalize those that are correctly using hypot?

KristofferC commented 7 years ago

Ref https://github.com/JuliaLang/julia/pull/23087

KristofferC commented 7 years ago

The alternative is of course to port our version of hypot to the other languages.

timholy commented 7 years ago

Given the discussion in #23087, it seems porting is the better choice. Either way, we should definitely make sure that all languages are performing the same calculation.

ViralBShah commented 7 years ago

I think every language should have a good hypot. :-)

jebej commented 7 years ago

Java has a hypot function, I can make a PR to use it if that sound appropriate.

affans commented 7 years ago

Could someone explain why one is faster? I don't get why calling hypot is slower

oscardssmith commented 7 years ago

hypot takes special care to not loose precision when the inputs are of very different size, which has a non-trivial performance cost.

ViralBShah commented 7 years ago

https://en.wikipedia.org/wiki/Hypot

lobingera commented 7 years ago

maybe my two cents here: The issue title is already uncorrect, because the benchmark shows the correct runtime for a basic mandelbrot implementation. For a language that supports z as complex value the abs(z) > 2 test is the normative implementation in code. Period.

Of course, other tests if the z=z^2+c sequence remains bounded are available and implementation of the algorithm in languages that don't have a complex data type available exist and look very much different (i remember a integer/fixedpoint only c implementation). And essentially this a very,very nice parallelisation target with independent calculations per entry in the result matrix/bitmap. So. Comparing in this benchmark languages with complex and non-complex arithmetic is already questionable, using a different test in julia would be a language specific optimization.

KristofferC commented 7 years ago

@lobingera it seems you have misunderstood the point of this issue quite severely. The whole point is that the different languages are not computing the same thing which makes comparing them misleading. No one is saying that julia should do any language specific optimizations.

mschauer commented 7 years ago

If "numerical accuracy" is a claim Julia makes (see e.g. second line in the first paragraph on julialang.org), isn't it then also fair to say that this might take some extra time here and there?

lobingera commented 7 years ago

@KristofferC you are right, i got confused. But you also ... all the programms are computing the same thing: when will the sequence z=z z+c exceed a bounded value (and mark that somewhere in memory, afaics all benchmarks assert the sum value, so they seemed to do the same thing). Testing the bound is straight forward abs(z) in all languages where it exists and sqrt(re re + im * im) where not.

You might argue, that sqrt(abs2(z)) in julia is closer to sqrt(re re + im im) because you know what's inside abs2 and abs. But the 'average new julia user trying the language the first time' will use abs(z). And btw: this already seems to be faster than C (calling cabs(z)). What do you want more?

KristofferC commented 7 years ago

Again, I just want all the languages to test the same thing. If we decided that the definition of abs(z) = hypot(real(z), imag(z) then we shouldn't implement it as sqrt(real(z)^2 + imag(z)^2) in other languages.

johnfgibson commented 7 years ago

How about using the test abs2(z) > 4 in all languages? That's a straightforward mathematical optimization that would be available or implementable in all languages, it would be a fair comparison, and it would probably eliminate the surprising variation in mandelbrot performance in the micro benchmarks.

On a side note, I've been thinking the benchmark page https://julialang.org/benchmarks/ could use some brief verbal descriptions of the benchmark algorithms along with any known explanations for oddities like this. E.g. if we keep the mandelbrot benchmarks as is, we explain that the odd variation is due some languages using the more accurate but more expensive hypot function inside abs.

lobingera commented 7 years ago

@KristofferC, it's actually not that complicated: the definition of abs(z) is absolute value of a complex number (in polar form) and a lot of math text books tell me: abs(z) = sqrt( re re + im im). You are free to implement this straight ahead - and have fun with numerics - or decide for another implementation like https://en.wikipedia.org/wiki/Hypot as julia seemed to have done.

KristofferC commented 7 years ago

Bringing your math book into the world of floating point numbers tends to be a bad idea.

Anyway, the only point here is that if different languages uses different algorithms to compute the same thing, then the comparison is meaningless. I'm sure you agree with that.

mauro3 commented 7 years ago

I think the question to ask is: is the use of hypot needed to accurately do the calculation of the benchmark? If "no", then Julia should use abs2. Conversely, if the answer is "yes", then Julia should use hypot and for all languages which don't have hypot, the benchmark would need to implement it as a separate, helper function.

Below was wrong, see https://github.com/JuliaLang/julia/issues/24097#issuecomment-338500862 It turns out that the answer is "no", as is shown by running with abs vs abs2 and using the test in line 55. So let's use abs2(x) > 2. As a side note, the answer goes from "no" to "yes" when reducing the step in line 54 from 0.01 to 0.001.

StefanKarpinski commented 7 years ago

As a side note, the answer goes from "no" to "yes" when reducing the step in line 54 from 0.01 to 0.001.

That's an interesting observation. Of course, a priori, one wouldn't know that it doesn't matter for 0.01, so in a sense not using abs is optimizing for the known answer in a way that a real world problem couldn't afford to do. However, I suspect that the abs2 version is also more accurate.

KristofferC commented 7 years ago

@johnfgibson How do you feel about updating the benchmarks again with abs2(z) > 4? I could update the codes but I don't have the whole stack to run all the benchmarks set up.

johnfgibson commented 7 years ago

Seems like I think this thread is converging on using hypot everywhere, implementing that as a function where necessary. I'm still mopping up a few things on the benchmarks, so I'll fold this in. Ok?

It turns out implementing hypot everywhere is a bit more straightforward than implementing abs2, simply because I don't understand whether real(z)*real(z) or real(z)^2 or square(real(z)) is best in all these languages.

KristofferC commented 7 years ago

Seems like I think this thread is converging on using hypot everywhere, implementing

I think on the contrary:

So let's use abs2(x) > 2 [sic should be 4]

However, I suspect that the abs2 version is also more accurate.

So just use abs2 in those languages that has them, and implement abs2(z) = re(z)*re(z) + im(z)*im(z) in those that don't?

StefanKarpinski commented 7 years ago

That does seem easiest, although it forgoes testing for performance of an accurate abs implementation, but we can always just have a showcase of numerical accuracy elsewhere. I think we already have a well-deserved reputation for being quite meticulous about that kind of thing.

johnfgibson commented 7 years ago

@KristofferC Sounds good, I'll do abs2. I should have a PR with this with and a few other things today or early next week. I got Go running again yesterday.

johnfgibson commented 7 years ago

Performance of iteration_mandelbrot using abs2 versus the prior mix of hypot and non-hypot abs functions.

	prior	abs2
c	0.266	0.074
fortran	.237	.056
go	0.185	0.060
java	0.096	0.084
javascript	0.084	0.08
julia	0.164	0.054
lua	0.103	0.071
mathematica	1.442	1.445
matlab	1.316	461.4
octave	166.0	1003.0
python	2.896	10.50
r	11.0	25.0

The results for Python, Matlab, and Octave show a big hit for switching to the user-defined function over the built-in. (Julia was the only language with a built-in abs2 function.) If abs2 is meant to be a sqrt-free optimization over abs, I'm not sure this switch is fair to those languages.

I could test the built-in abs functions to see if they're using hypot underneath, and implement hypot where necessary. That seems more fair to me: we would be imposing the cost of a user-defined function only on languages that didn't implement an accurate abs.

EDIT: another alternative would be to inline the abs2 function wherever it helps.

StefanKarpinski commented 7 years ago

Python, Matlab, and Octave show a big hit for switching to the user-defined function over the built-in.

If abs2 is meant to be a sqrt-free optimization over abs, I'm not sure this switch is fair to those languages.

Not being able to run user-defined functions efficiently seems like an entirely fair thing to measure. That's part of the whole point. Julia's "built in" abs2 is literally just defined as: https://github.com/JuliaLang/julia/blob/39f668cc67bfa95782b27c78b1a5ab4d1d3c1d54/base/complex.jl#L244 If a language implementation can't make something that simple run fast, that's a pretty severe limitation and one that's certainly worth measuring. Big picture, this issue is a microcosm of the complaints we get about these micro benchmarks:

If we use abs then some people complain that it's unfair because they would do the abs2 trick manually and get much better performance.
If we use abs2 then other people complain that it's unfair because it's not one of their language's "blessed" built in functions that are fast while everything user-defined is slow as :poop:.

The only way people won't complain is if we make the benchmarks no longer apples to apples and let each language use whatever it's preferred "pet approach" is – at which point the benchmarks are totally meaningless. The real take away is that instead of kvetching that people aren't being fair, Julia is fast and accurate whichever way you prefer to write this benchmark.

StefanKarpinski commented 7 years ago

At this point I think we should just do the most straightforward thing: implement the abs2 function in each language and use it and just say that this is now the benchmark; we can do a user-defined abs2 function in Julia as well, but of course it will be identical to the built-in one.

nalimilan commented 7 years ago

At least, if a language implements the precise function and it's faster than the custom implementation, better use it. It would sound very unfair to use a slower custom approach because the default is "too correct".

In general, I'd tend to be conservative, as being accused of cheating is much worse than a slightly slower Julia benchmark.

StefanKarpinski commented 7 years ago

The Julia benchmark is among the fastest in all of the apples to apples comparisons we've done. The abs2 approach is the simplest to implement across languages. It also happens to look bad for a bunch of languages with poor support for user-defined functions, but unless someone really feels like implementing hypot correctly in all of the languages, then that seems unavoidable. Note that implementing hypot will be just as slow if not slower. So basically, the only thing that doesn't make languages with poor user function support is a meaningless comparison where they do less work.

ViralBShah commented 7 years ago

I like using this benchmark as a proxy for the performance of user-defined functions, which has been something that is missing in our benchmarks. We could even announce this change in a blog, with all the data.

johnfgibson commented 7 years ago

@StefanKarpinski & @ViralBShah : Makes total sense, especially if we switch the name to userfunc_mandelbrot or userfunc_abs2 and provide in a link short descriptions of what issue each benchmark is trying to highlight.

I'd also be happy to implement hypot in all the languages. It wouldn't be hard, it would touch the same issue, and since it's based on the intent of accuracy instead of a mistaken guess about optimizing out the sqrt in abs, it might trigger fewer off-base complaints about unfairness. And userfunc_hypot would get the point across. Let me know if you prefer that; otherwise I'll go with abs2.

KristofferC commented 7 years ago

@mauro3 Could you repeat your experiment (and make sure you test abs2(z) > 4. I get no difference, even with a small step size.

PallHaraldsson commented 7 years ago

It seems like a no-brainer that we should use calculation doing the same in all languages. EDIT: I don't mind to much if we use abs or abs2, if other languages to the same, but probably best to do both in the benchmark (at least in a blog post) e.g. also "iteration_mandel_accurate".

I think however that we should at a minimum a) explain why Julia "improved" relatively, i.e. the language didn't, only the Mandelbrot code was changed "to be fair". Note, despite the name "mandel" I think the goal isn't strictly to calculate it or hypot, but to show loop speed; wasn't the name changed to "iteration_mandel".

b) We could use this opportunity, to explain that the original code for Mandelbrot was more accurate before in Julia (if* it was), i.e. the obvious code may be slower, but needs not mean worse, even to opposite.

We could link to this issue or even:

https://github.com/JuliaLang/julia/issues/24097#issuecomment-336366616

@mschauer "If "numerical accuracy" is a claim Julia makes"

*You can zoom into Mandelbrot to infinity, and my understanding is that hypot will at some point will be better, But for benchmark reasons you don't to that, and at some zoom-level (or none) we get bit-identical pictures. If we wish we could show (in a blog post) how the slower straight-forward Julia code that is slower, gets more accurate pictures.

johnfgibson commented 7 years ago

Oops, the issue is not accuracy, it's overflow/underflow for very large/small x or y. hypot and the naive abs are equally accurate for the order-1 x and y in this benchmark. So sticking with abs2 makes sense.

mbauman commented 7 years ago

This would make an incredible blog post. If you're actually willing to implement hypot in all the languages, @johnfgibson, then it'd be amazing to compare all these possible algorithms across the different languages. For example, there could be up to six benchmarks for each language: user functions that performed naiveabs, abs2, and hypot, and then the equivalent builtins if they exist. That would mean that Julia would have five datapoints since it doesn't export a naive abs "builtin"… and the user/builtin benchmark pairs for Julia should be identical.

This would be really compelling.

Edit: to be clear on this issue — I agree we should just pick one of the user-defined options for the julialang.org benchmark.

foobarlv commented 7 years ago

Using abs2(x)>4.0 is the correct code; using high precision math is just stupid there. Using float32 for the convergence test would also be correct and admissible, if faster (but would not be admissible for the iteration!). Any floating point errors are just used for the comparison and not propagated; hypot is significantly better than abs only for very small or large numbers and hence irrelevant when comparing against 4.

Hence, I'd say that all code benchmark programs should use the fastest out of (abs2, abs, user-defined inline sum that avoids function-call overhead).

In some sense, however, I think that we should not submit code for other languages at all; a fair apple-to-apple comparison would be against the best of e.g. http://benchmarksgame.alioth.debian.org/ (if we happen to have faster code for other languages, we should submit it there!), while taking great care to also use (and document!) the compiler-flags and possibly versions optimized by other people.

That is, compare code written by people intimately familiar with the language and compiler, without requiring readers to believe that julia devs speak other languages as well.

The standard complaint against the julia microbenchmark is "sure, best possible julia code vs stupid-naive C code, not fair, real C coders would of course do $(complicated trick that needs specific icc version)". This cannot be fixed by changing abs to abs2.

Even better, of course, would be to get included in the relatively standardized benchmark game, so people can quickly compare $lang vs julia without trusting presumably biased julia devs.

mauro3 commented 7 years ago

@KristofferC yes, I messed up in https://github.com/JuliaLang/julia/issues/24097#issuecomment-336366616 (but it wasn't abs2(z) > 2, but I don't have the code anymore). I don't see any difference in abs vs abs2 either using this code: https://gist.github.com/mauro3/a38f904ab8bda6dc5691040c84ae4e79 Well, this makes the case for abs2 stronger.

StefanKarpinski commented 7 years ago

@foobarlv, we already have implementations of the benchmarks game codes and Julia's competitiveness with fast, compiled languages is borne out there. I encourage you to petition the benchmarks game maintainers to add Julia to the set of languages included. These benchmarks serve a different purpose than those: they compare the same algorithms across a set of language implementations. Some people don't care for this, but it has already been discussed ad nauseum.

Gnimuc commented 7 years ago

@johnfgibson just want to confirm, in this line https://github.com/JuliaLang/julia/blob/c4558e313e91031456206d947e2315268eceb952/test/perf/micro/perf.m#L110

the ; is missing, Matlab'll print every intermediate result, which causes huge performance regression. Is this intended(to mimic those sloppy people) or just a typo?

johnfgibson commented 7 years ago

@Gnimuc Yes, that does look like an error, and it would explain why the hit for abs2 on Matlab was so bad. I'll rerun the tests today. Thanks for spotting this!

johnfgibson commented 7 years ago

Yes, that improves Matlab's mandel performance by factor of 300. Thanks very much for spotting this. I knew something was wrong here but didn't think hard enough about it. I'll get the corrected results to julialang.org a.s.a.p. then a patch to julia-0.7.0-DEV.

Gnimuc commented 7 years ago

A little bit off-topic: is it still necessary to show Octave in the figure? It stretched the Y-axis a lot 😛 . I guess we would get a better zoomed-in figure without showing it.

jebej commented 7 years ago

Also, the javascript print_to_file test has a time of 0 sec, which is suspicious.

johnfgibson commented 7 years ago

The 0.0 for javascript print_to_file is actually missing data, since JavaScript is sandboxed and can't write to disk (at least as far as I understand). Maybe someone with perl chops could modify line 78 in tes/perf/micro/bin/table.pl

printf qq[<td class="data">%.2f</td>], $_{$benchmark}{$system}/$_{$benchmark}{'c'};

so that it prints "--" in the table instead of 0.0 when the value of the expression is zero. Until then I'll try to remember to make this change by editing the html file.

@Gnimuc I like the idea of dropping Octave in order to zoom in better. I'll do it if Stefan and/or Viral approve.

StefanKarpinski commented 7 years ago

You could also port the Perl script to Julia if you want. I don't recall why I write it in Perl aside from the fact that it's very old and I had more experience with writing this kind of code in Perl back then.

ViralBShah commented 7 years ago

I feel that it's useful to have all the languages - these benchmarks are being widely improved and referenced by everyone. Perhaps with some javascript goodness, we can make the graph zoomable.

StefanKarpinski commented 7 years ago

I feel that it's useful to have all the languages

There's a Perl script that processes benchmark output that could be rewritten in Julia, which is not useulf to have in Perl. We do not implement the benchmarks in Perl (at the moment).

mauro3 commented 7 years ago

I think @ViralBShah was referring to the comment about dropping Octave: https://github.com/JuliaLang/julia/issues/24097#issuecomment-339998571 but recommended to keep it.

JuliaLang / julia

Micro benchmarks for mandelbrot are misleading #24097