Closed crisluengo closed 6 years ago
When posting this, I wasn't aware that this code migrated from a different repository. I looked through the issues in JuliaLang, where this topic has been raised before.
I sort of agree with the sentiment there, that it is the interpreter/JIT that needs to be tested here, not the underlying C code. The printfd
suggestion is vectorization in a sense, feel free to ignore that if you wish. But the parseintperf
implementation does need to be fixed. Especially the assert
call within the loop is unfair! :)
@crisluengo: Thanks for this. I looked through the benchmarks codes and found the m==n
assertion outside the loop in several languages. I'll bring all of them inside the loop in my next PR.
On the printfd
benchmark, the issue is really to test many repeated calls to fprintf
. So we'll leave it as is.
BTW, @ScottPJones has commented elsewhere that the parseint benchmark measures random number generation as much as it does integer parsing. To keep this benchmark focused on integer parsing I think it would be a good idea to substitute a quick-and-dirty one-line linear congruential random number generation in place of the call to the language's random number generator.
I tried this in C and Julia. In C it makes almost no difference. In Julia it dropped the execution time by 1/3.
What do you all think? Shall I substitute the linear congruence one-liner in all the codes and submit a PR?
Octave parse_integers gets a factor of 4 speedup from switching to n = fix(rand*(2^32))
as suggested by @crisluengo.
I was wrong about the simpler random number generation dropping the Julia parse_integers time by 1/3. The difference is actually insignificant. I'll push that Octave fix and then close this issue.
function parseint1(t)
local m,n
for i=1:t
n = rand(UInt32)
s = string(n, base = 16)
m = UInt32(parse(Int64,s, base = 16))
@assert m == n
end
return n
end
function parseint2(t)
a::UInt64 = 1664525 # mininum standard linear congruence rand
p::UInt64 = 2147483647 # Park & Miller Comm ACM v31 n10 1988, p = 2^32-1
n::UInt32 = 17
local m::UInt32
for i=1:t
n = UInt32(a*n % p)
s = string(n, base = 16)
m = UInt32(parse(Int64,s, base = 16))
@assert m == n
end
return n
end
With julia-0.7.0-beta.152
julia> @btime parseint1(10^5)
23.118 ms (299999 allocations: 10.68 MiB)
0x248be6fe
julia> @btime parseint2(10^5)
21.718 ms (200000 allocations: 9.16 MiB)
0x0ee7bf52
In that case, which should be the "de facto" choice of PRNG among benchmarks? Will any cheap, readily available algorithm be considered acceptable?
I think it's fair to do simple changes to keep the overhead of random number generation low enough that the benchmark measures integer parsing, primarily. But working hard to shave off every possible cpu cycle, no.
For Octave, that was a one-line fix that removed a factor-of-four random-number overhead. Significant and fair.
For Julia, inlining a linear congruential pseudo-rand cut about 5% off execution time. That's insignificant, given that C beats Julia by a factor of 2 on this benchmark (for julia 0.7.0). So I left as is: the Julia parse_integer calls the library function rand(UInt32)
.
@johnfgibson, sorry for not replying earlier. Thanks for looking into this.
However, the more I think about this benchmark, the more I think it's impossible to do it right. Right now the idea is really to test the interpreter, rather than the computational speed. Paraphrasing an argument I read on another thread: "what is interesting about comparing the underlying C code?" That is fine, but:
The current "matrix_multiply" tests the underlying BLAS and PRNG implementations used, not the interpreter, and consequently doesn't match the benchmark philosophy.
The current benchmark gives the wrong impression, no matter how you introduce the results, because people only look at the graph. This leads to surprises like that expressed here: https://stackoverflow.com/q/51181392/7328782
These tests use some tools in a fairly awkward manner to be able to compare what they want to compare. To me these results sound more like "this hammer is not as good at screwing in screws as this screwdriver".
For example, the fprintf
test: The name of the MATLAB function is the same as that of the C function, but they actually are different functions: the MATLAB fprintf
is designed to dump a big load of data to file, not just a single number. You never call fprintf
many times in a row, you just call it once. There is no need to optimize its call overhead. And this test is presented as "print_to_file", but that is not how you print to file in MATLAB...
Anyway, I don't know how to do this better, I just feel that this benchmark doesn't produce useful results. It's easy to be a critic, I know. :)
Honestly, we could use anything—maybe a simple linear congruential generator (or PCG)? The whole point was just to defeat any sharing of work that a compiler could do to make parsing go faster across iterations. I got burned once upon a time by PyPy specializing "%d %d\n" % (i, i)
by only decoding the integer digits to decimal once, which is most of the work.
I got burned once upon a time by PyPy specializing "%d %d\n" % (i, i) by only decoding the integer digits to decimal once, which is most of the work.
I would guess that fib
has a similar concern: if the program does not properly isolate the calls to this function, the compiler could either move fib(20)
out of the loop or remove it entirely. Some languages can avoid this with compiler hints (e.g. Rust's unstable black_box
) or volatile memory access (as in C), and the PRNG could probably even be entirely avoided in parse_int
.
In the Rust program, I used the Mersenne Twister algorithm because the C program was using it as well. It is actually more trivial to use an XOR-shift implementation, which is readily available in the rand
crate.
Nevertheless, I agree that the focus should be on mitigating cases of notable unfairness, such as the ones seen here in the Matlab code, while giving observers of these benchmarks a clear idea of how much it takes for each language to attain the presented measurements.
@crisluengo I agree with your points 1 and 2 but not so much 3. The microbenchmarks are intentionally not the optimal algorithms in any language for the given computation --nobody in their right mind computes fib
recursively. Rather they're primarily meant measure the efficiency of hot loops / deep recursion of native user-level code, one of Julia's great strengths On Matlab fprintf
, one perspective is that the accepted style of caching tons of data in a vector and then dumping it to disk is an awkward workaround necessitated by the fact that Matlab can't print small chunks of data efficiently. If you're accustomed to Matlab it might seem natural, but if you're coming from C you might think "what a pain."
On 1, I agree that the the LAPACK and BLAS based benchmarks are a bit of an odd fit. I've been thinking it would be good to split those off into a "macrobenchmarks" set and add a bunch of higher-level numerical benchmarks that utilize libraries, like Ax=b problems, eigenvalue problems, integrating ODEs and PDEs, finding roots of nonlinear equations, etc. These would explicitly test "how fast are the numerical libraries?" and "how fast are the numerical libraries when they call user-level functions?", as compared to the pure native tight loops of the macrobenchmarks.
On 2, what can you do? www.julialang.org/benchmarks is as clear as it can be. It might help to add brief descriptions of each algorithm below the table, in order to be perfectly clear about what is being tested in each benchmark. E.g. that fib
is recursive, that print_to_file
tests a long loop of small fprintf
calls, etc.
@johnfgibson "what can you do?" This is a good question -- as I said, it's easy to be a critic and point out issues, solving those issues is a lot harder.
One possible solution could be to not use names such as print_to_file
in the benchmark results table, but write a short description such as "write two integers to file" in that space. You could add a column to the right that contains the name of the function that implements the tests (BTW these currently don't always seem to match the name given in the table).
I wanted to add that timing differences such as in the matrix_statistics
tests are the most interesting to me. Here the difference between Julia and C is (at least partially) the ability of C to easily re-use allocated memory, the Julia code needs to allocate new arrays within the loop.
Also interesting would be to include in these benchmark results the amount of code needed to write each of the tests -- another huge benefit of using Julia over C.
Here the difference between Julia and C is (at least partially) the ability of C to easily re-use allocated memory, the Julia code needs to allocate new arrays within the loop.
Julia can more easily re-use the allocated memory, just the code isn't as nice if you do that. It could be re-written to do a bunch of mul!
and in-place broadcasting though (and any package code would do this), but there's no reason to over-optimize the Julia code.
One possible solution could be to not use names such as print_to_file in the benchmark results table, but write a short description such as "write two integers to file" in that space. You could add a column to the right that contains the name of the function that implements the tests (BTW these currently don't always seem to match the name given in the table).
Back in November or so we made some name changes to be clearer about the nature of each test, e.g. fibonacci
-> recursion_fibonacci
, to say we're testing recursion, not the fastest way to compute fib(20)
, pi_sum
-> iteration_pi_sum
, etc. I feel like those names are already as long as they can be, given that they also need to appear in the legend of the benchmark plot on www.julialang.org. My feeling is that a brief description of each benchmark and perhaps the Julia code following the benchmark table will provide a good way to clear up any confusion about the nature of each benchmark.
Another option would be to pregenerate the random values using s fixed PRNG outside of the timing part so that the massive slowdown from writing a custom PRNG in the slow languages doesn’t end up dominating the benchmark.
The MATLAB/Octave benchmark does not provide a fair comparison, as the code is not written in a way that an experienced user would write it.
I'm using Octave 4.0 here, will try this tonight at home on MATLAB as well. I looked at the two tests that have the largest times:
parseintperf
Running with
t = 10000
and the following changes, I see these times:The cast to
uint32
is not something you'd normally do when working in MATLAB.In Octave,
randi
usesrand
, and does a lot of useless checking of input arguments. It's possible that under MATLAB the difference is not as large. Nonetheless, usingrandi
to generate an integer starting at 0 is not really fair, it generates an index, which start at 1 in MATLAB. Forcing it to start at 0 is not normal use case, and would normally be implemented simply by scaling the output torand
instead (which is what Octave's implementation ofrandi
does, even if it's not a totally correct way to generate a uniformly-distributed integer).I also noticed that the Julia version of this code does an
assert
only once, after the loop, not inside the loop!printfd
Running
printfd(100000)
takes 1.19784 seconds. But this function is very much not like an experienced MATLAB user would write it. Instead we'd do:Now the same function call takes 0.109623 seconds. Note that the output here is identical, as
fprintf
repeats the template for each element in the input arguments.Maybe you're intending to time function calls here, not the actual I/O?