MATLAB/Octave benchmark

crisluengo commented 6 years ago

The MATLAB/Octave benchmark does not provide a fair comparison, as the code is not written in a way that an experienced user would write it.

I'm using Octave 4.0 here, will try this tonight at home on MATLAB as well. I looked at the two tests that have the largest times:

parseintperf

Running with t = 10000 and the following changes, I see these times:

function parseintperf(t)
for i = 1:t
   %n = randi([0,2^32-1],1,'uint32'); % 2.34652 seconds.
   %n = randi([0,2^32-1]);            % 1.64658 seconds.
   n = fix(rand*(2^32));              % 0.864399 seconds.
   s = sprintf('%08x',n);
   m = sscanf(s,'%x');
   assert(m == n);
end

The cast to uint32 is not something you'd normally do when working in MATLAB.

In Octave, randi uses rand, and does a lot of useless checking of input arguments. It's possible that under MATLAB the difference is not as large. Nonetheless, using randi to generate an integer starting at 0 is not really fair, it generates an index, which start at 1 in MATLAB. Forcing it to start at 0 is not normal use case, and would normally be implemented simply by scaling the output to rand instead (which is what Octave's implementation of randi does, even if it's not a totally correct way to generate a uniformly-distributed integer).

I also noticed that the Julia version of this code does an assert only once, after the loop, not inside the loop!

printfd

Running printfd(100000) takes 1.19784 seconds. But this function is very much not like an experienced MATLAB user would write it. Instead we'd do:

function printfd(n)
f = fopen('/dev/null','w');
i = 1:n;
fprintf(f, '%d %d\n', i, i + 1);
fclose(f);

Now the same function call takes 0.109623 seconds. Note that the output here is identical, as fprintf repeats the template for each element in the input arguments.

Maybe you're intending to time function calls here, not the actual I/O?

crisluengo commented 6 years ago

When posting this, I wasn't aware that this code migrated from a different repository. I looked through the issues in JuliaLang, where this topic has been raised before.

I sort of agree with the sentiment there, that it is the interpreter/JIT that needs to be tested here, not the underlying C code. The printfd suggestion is vectorization in a sense, feel free to ignore that if you wish. But the parseintperf implementation does need to be fixed. Especially the assert call within the loop is unfair! :)

johnfgibson commented 6 years ago

@crisluengo: Thanks for this. I looked through the benchmarks codes and found the m==n assertion outside the loop in several languages. I'll bring all of them inside the loop in my next PR.

On the printfd benchmark, the issue is really to test many repeated calls to fprintf. So we'll leave it as is.

BTW, @ScottPJones has commented elsewhere that the parseint benchmark measures random number generation as much as it does integer parsing. To keep this benchmark focused on integer parsing I think it would be a good idea to substitute a quick-and-dirty one-line linear congruential random number generation in place of the call to the language's random number generator.

I tried this in C and Julia. In C it makes almost no difference. In Julia it dropped the execution time by 1/3.

What do you all think? Shall I substitute the linear congruence one-liner in all the codes and submit a PR?

johnfgibson commented 6 years ago

Octave parse_integers gets a factor of 4 speedup from switching to n = fix(rand*(2^32)) as suggested by @crisluengo.

I was wrong about the simpler random number generation dropping the Julia parse_integers time by 1/3. The difference is actually insignificant. I'll push that Octave fix and then close this issue.

function parseint1(t)
    local m,n
    for i=1:t
        n = rand(UInt32)
        s = string(n, base = 16)
        m = UInt32(parse(Int64,s, base = 16))
        @assert m == n
    end
    return n
end

function parseint2(t)
    a::UInt64 = 1664525     # mininum standard linear congruence rand
    p::UInt64 = 2147483647  # Park & Miller Comm ACM v31 n10 1988, p = 2^32-1
    n::UInt32 = 17          
    local m::UInt32 

    for i=1:t
        n = UInt32(a*n %  p)
        s = string(n, base = 16)
        m = UInt32(parse(Int64,s, base = 16))
        @assert m == n
    end
    return n
end

With julia-0.7.0-beta.152

julia> @btime parseint1(10^5)
  23.118 ms (299999 allocations: 10.68 MiB)
0x248be6fe

julia> @btime parseint2(10^5)
  21.718 ms (200000 allocations: 9.16 MiB)
0x0ee7bf52

Enet4 commented 6 years ago

In that case, which should be the "de facto" choice of PRNG among benchmarks? Will any cheap, readily available algorithm be considered acceptable?

johnfgibson commented 6 years ago

I think it's fair to do simple changes to keep the overhead of random number generation low enough that the benchmark measures integer parsing, primarily. But working hard to shave off every possible cpu cycle, no.

For Octave, that was a one-line fix that removed a factor-of-four random-number overhead. Significant and fair.

For Julia, inlining a linear congruential pseudo-rand cut about 5% off execution time. That's insignificant, given that C beats Julia by a factor of 2 on this benchmark (for julia 0.7.0). So I left as is: the Julia parse_integer calls the library function rand(UInt32).

crisluengo commented 6 years ago

@johnfgibson, sorry for not replying earlier. Thanks for looking into this.

However, the more I think about this benchmark, the more I think it's impossible to do it right. Right now the idea is really to test the interpreter, rather than the computational speed. Paraphrasing an argument I read on another thread: "what is interesting about comparing the underlying C code?" That is fine, but:

The current "matrix_multiply" tests the underlying BLAS and PRNG implementations used, not the interpreter, and consequently doesn't match the benchmark philosophy.
The current benchmark gives the wrong impression, no matter how you introduce the results, because people only look at the graph. This leads to surprises like that expressed here: https://stackoverflow.com/q/51181392/7328782
These tests use some tools in a fairly awkward manner to be able to compare what they want to compare. To me these results sound more like "this hammer is not as good at screwing in screws as this screwdriver".

For example, the fprintf test: The name of the MATLAB function is the same as that of the C function, but they actually are different functions: the MATLAB fprintf is designed to dump a big load of data to file, not just a single number. You never call fprintf many times in a row, you just call it once. There is no need to optimize its call overhead. And this test is presented as "print_to_file", but that is not how you print to file in MATLAB...

Anyway, I don't know how to do this better, I just feel that this benchmark doesn't produce useful results. It's easy to be a critic, I know. :)

StefanKarpinski commented 6 years ago

Honestly, we could use anything—maybe a simple linear congruential generator (or PCG)? The whole point was just to defeat any sharing of work that a compiler could do to make parsing go faster across iterations. I got burned once upon a time by PyPy specializing "%d %d\n" % (i, i) by only decoding the integer digits to decimal once, which is most of the work.

Enet4 commented 6 years ago

I got burned once upon a time by PyPy specializing "%d %d\n" % (i, i) by only decoding the integer digits to decimal once, which is most of the work.

I would guess that fib has a similar concern: if the program does not properly isolate the calls to this function, the compiler could either move fib(20) out of the loop or remove it entirely. Some languages can avoid this with compiler hints (e.g. Rust's unstable black_box) or volatile memory access (as in C), and the PRNG could probably even be entirely avoided in parse_int.

In the Rust program, I used the Mersenne Twister algorithm because the C program was using it as well. It is actually more trivial to use an XOR-shift implementation, which is readily available in the rand crate.

Nevertheless, I agree that the focus should be on mitigating cases of notable unfairness, such as the ones seen here in the Matlab code, while giving observers of these benchmarks a clear idea of how much it takes for each language to attain the presented measurements.

johnfgibson commented 6 years ago

@crisluengo I agree with your points 1 and 2 but not so much 3. The microbenchmarks are intentionally not the optimal algorithms in any language for the given computation --nobody in their right mind computes fib recursively. Rather they're primarily meant measure the efficiency of hot loops / deep recursion of native user-level code, one of Julia's great strengths On Matlab fprintf, one perspective is that the accepted style of caching tons of data in a vector and then dumping it to disk is an awkward workaround necessitated by the fact that Matlab can't print small chunks of data efficiently. If you're accustomed to Matlab it might seem natural, but if you're coming from C you might think "what a pain."

On 1, I agree that the the LAPACK and BLAS based benchmarks are a bit of an odd fit. I've been thinking it would be good to split those off into a "macrobenchmarks" set and add a bunch of higher-level numerical benchmarks that utilize libraries, like Ax=b problems, eigenvalue problems, integrating ODEs and PDEs, finding roots of nonlinear equations, etc. These would explicitly test "how fast are the numerical libraries?" and "how fast are the numerical libraries when they call user-level functions?", as compared to the pure native tight loops of the macrobenchmarks.

On 2, what can you do? www.julialang.org/benchmarks is as clear as it can be. It might help to add brief descriptions of each algorithm below the table, in order to be perfectly clear about what is being tested in each benchmark. E.g. that fib is recursive, that print_to_file tests a long loop of small fprintf calls, etc.

crisluengo commented 6 years ago

@johnfgibson "what can you do?" This is a good question -- as I said, it's easy to be a critic and point out issues, solving those issues is a lot harder.

One possible solution could be to not use names such as print_to_file in the benchmark results table, but write a short description such as "write two integers to file" in that space. You could add a column to the right that contains the name of the function that implements the tests (BTW these currently don't always seem to match the name given in the table).

I wanted to add that timing differences such as in the matrix_statistics tests are the most interesting to me. Here the difference between Julia and C is (at least partially) the ability of C to easily re-use allocated memory, the Julia code needs to allocate new arrays within the loop.

Also interesting would be to include in these benchmark results the amount of code needed to write each of the tests -- another huge benefit of using Julia over C.

ChrisRackauckas commented 6 years ago

Here the difference between Julia and C is (at least partially) the ability of C to easily re-use allocated memory, the Julia code needs to allocate new arrays within the loop.

Julia can more easily re-use the allocated memory, just the code isn't as nice if you do that. It could be re-written to do a bunch of mul! and in-place broadcasting though (and any package code would do this), but there's no reason to over-optimize the Julia code.

johnfgibson commented 6 years ago

One possible solution could be to not use names such as print_to_file in the benchmark results table, but write a short description such as "write two integers to file" in that space. You could add a column to the right that contains the name of the function that implements the tests (BTW these currently don't always seem to match the name given in the table).

Back in November or so we made some name changes to be clearer about the nature of each test, e.g. fibonacci -> recursion_fibonacci, to say we're testing recursion, not the fastest way to compute fib(20), pi_sum -> iteration_pi_sum, etc. I feel like those names are already as long as they can be, given that they also need to appear in the legend of the benchmark plot on www.julialang.org. My feeling is that a brief description of each benchmark and perhaps the Julia code following the benchmark table will provide a good way to clear up any confusion about the nature of each benchmark.

StefanKarpinski commented 6 years ago

Another option would be to pregenerate the random values using s fixed PRNG outside of the timing part so that the massive slowdown from writing a custom PRNG in the slow languages doesn’t end up dominating the benchmark.

JuliaLang / Microbenchmarks

MATLAB/Octave benchmark #4

parseintperf

printfd