Open johnnychen94 opened 2 years ago
We just need someone willing and able to get all the programs running on a single benchmark system.
Could this instead be set up using GitHub actions for continuous benchmarking?
See https://labs.quansight.org/blog/2021/08/github-actions-benchmarks/ for a discussion. It might be a good option, as the performances on a GitHub action VM are pretty consistent:
If you run all benchmarks with a single action, then you would guarantee the same VM each time, and could measure the performance ratios.
+you could also see the performance comparison across a variety of versions of each language.
(This would exclude proprietary languages such as mathematica)
A bunch of the benchmarked programs are proprietary such ass Matlab and Mathematica.
It's only those two right? I think it's very reasonable to exclude proprietary software in a reproducible benchmark. e.g., https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html excludes any proprietary languages. Yes you would lose a couple of data points, but you would have an always up-to-date benchmark, which I see as significantly more important.
In my opinion, the most important comparisons are against C, Rust, and Fortran (+ maybe numpy), since users of those packages are the ones who would look up speed comparisons - not so much Mathematica users. As long as those are included, we are good.
As an alternative option, it seems there are some free versions provided by MATLAB and Mathematica which are available for GitHub actions:
the most important comparisons are against C, Rust, and Fortran
I disagree. First, it's going to be boring: any language that doesn't get in your way will be pretty fast. Second, the point of Julia is to be good at two things: easier to use than C and faster to run than Python. That's a point the benchmarks have to make, which between their source code and raw numbers, they do.
easier to use than C and faster to run than Python.
I think that's what benchmarks against fast languages show, no?
Anyways, this is a second order effect. The more important point is updating out-of-date benchmarks. Basically I am saying that I don't think mathematica/matlab should be roadblocks to getting updated results against C/Rust/Fortran/Python.
See #51 which drafts a GitHub workflow for running the suite
My feeling is that MileCranmer is right, that having up-to-date benchmarks against just open languages is better than having them held up, going back multiple Julia versions, in order to have a couple proprietary languages.
I really enjoyed doing the benchmarks back for Julia-1.0, but I haven't able to keep it up, due to the investment of time to update each language environment (many with their own peculiar set-up and build system), and also COVID, which has kept me working at home without access to campus-locked proprietary licenses. I was hoping to return this past fall, but delta and omicron have kept me from that.
So I'm supportive of your effort to do this via GitHub workflow.
Thanks @johnfgibson.
So, with the loss of my sanity, I finally got the workflow running in #51 - it correctly generates the various csv files. This is in spite of most languages being easy to set up since there are already GitHub actions available which stack up on the same VM. I therefore greatly empathize with @johnfgibson in knowing that you had to set these up manually each time...
The workflow runs for the following languages:
The following benchmarks are not part of the current workflow, for the reasons given below:
I think these excluded languages are lower priority, so I would vote for simply displaying the up-to-date benchmarks with the other languages. Then if/when the broken ones are fixed we can turn them back on. Thoughts?
Here are the actual updated benchmarks, copied from the workflow's output. Could these automatically update the website after #51 is merged?
Seems like parse_integers got a massive improvement compared to the currently displayed results, which is awesome. matrix_multiply also seems like it has put Julia clear in the lead now:
c,iteration_pi_sum,8.028984
c,matrix_multiply,43.012142
c,matrix_statistics,5.007982
c,parse_integers,0.19634
c,print_to_file,20.508051
c,recursion_fibonacci,0.025188
c,recursion_quicksort,0.422955
c,userfunc_mandelbrot,0.08167
fortran,iteration_pi_sum,8.028663
fortran,matrix_multiply,57.163952
fortran,matrix_statistics,8.046266
fortran,parse_integers,0.753935
fortran,print_to_file,113.916496
fortran,recursion_fibonacci,4.4e-5
fortran,recursion_quicksort,0.483927
fortran,userfunc_mandelbrot,7.8e-5
java,iteration_pi_sum,16.370829
java,iteration_sinc_sum,0.049201
java,matrix_multiply,788.083768
java,matrix_statistics,30.276736
java,parse_integers,0.274402
java,print_to_file,99.797282
java,recursion_fibonacci,0.0424
java,recursion_quicksort,1.006608
java,userfunc_mandelbrot,0.136501
javascript,iteration_pi_sum,10.5
javascript,matrix_multiply,2900.0
javascript,matrix_statistics,46.9
javascript,parse_integers,0.64
javascript,print_to_file,118.0
javascript,recursion_fibonacci,0.109
javascript,recursion_quicksort,1.61
javascript,userfunc_mandelbrot,0.149
julia,iteration_pi_sum,8.028063
julia,matrix_multiply,33.387676
julia,matrix_statistics,8.219065
julia,parse_integers,0.137201
julia,print_to_file,18.368588
julia,recursion_fibonacci,0.0482
julia,recursion_quicksort,0.469904
julia,userfunc_mandelbrot,0.0796
python,iteration_pi_sum,630.6591033935547
python,matrix_multiply,49.559593200683594
python,matrix_statistics,51.499128341674805
python,parse_integers,1.6732215881347656
python,print_to_file,54.22806739807129
python,recursion_fibonacci,2.522706985473633
python,recursion_quicksort,11.09170913696289
python,userfunc_mandelbrot,6.908893585205078
r,iteration_pi_sum,236.0
r,matrix_multiply,116.0
r,matrix_statistics,78.0
r,parse_integers,4.0
r,print_to_file,1325.0
r,recursion_fibonacci,10.0
r,recursion_quicksort,22.0
r,userfunc_mandelbrot,20.0
rust,iteration_pi_sum,8.029562
rust,matrix_multiply,46.196557
rust,matrix_statistics,6.52925
rust,parse_integers,0.186271
rust,print_to_file,11.194186
rust,recursion_fibonacci,0.046293
rust,recursion_quicksort,0.428904
rust,userfunc_mandelbrot,0.080522
How do we update the benchmarks webpage?
@MilesCranmer thanks for the amazing work on the ci!
Lua, Go (install fine, but the benchmarks are out-of-date with current syntax)
I have taken @sbinet's #27 and added a commit to enable the go benchmark in #55
I'll try to work on getting Lua working if I get some time later. Edit: got Lua working as well.
Also do you know of a way to get the system hardware specifications from the CI machine? While it doesn't necessarily matter to get the comparison, know the actual hardware might help interpret the numbers between CI runs.
Nice work!
I don't know how to get the hardware specs for a particular workflow. According to the docs, linux runs always use 2-core CPU, 7 GB of RAM memory, 14 GB of SSD disk space, (in a virtual machine) but they don't specify whether the CPU changes.
According to the article here - https://labs.quansight.org/blog/2021/08/github-actions-benchmarks/, the times are noisy, so times should only be interpreted relative to C, rather than absolute times.
I see, no worries. Thanks for digging that info up.
How do we update the benchmarks webpage?
The code for the benchmark webpage is located here: https://github.com/JuliaLang/www.julialang.org/blob/main/benchmarks.md
The code used to create the graph is located here and other assets is located here: https://github.com/JuliaLang/www.julialang.org/tree/main/_assets/benchmarks
I went ahead and updated the plotting code used to work with newer package versions and julia v1.7.2 See https://github.com/JuliaLang/www.julialang.org/pull/1648
And using the benchmark data output from the CI plotted the following graph:
A couple notes: The data used was from the following CI run #57 : https://github.com/JuliaLang/Microbenchmarks/runs/5531819551?check_suite_focus=true
For some Fortran benchmarks (see: #58) the values were interpolated based on the ratio of the old Fortran/C benchmark and that ratio multiplied with the newer C time in the CI benchmarks.csv file. The old ratio is computed based on the data (located here) used to create the current plot on the benchmarks webpage
Similarly the Matlab/Mathematica values were interpolated based on their ratios from the same older benchmarks data. I decided to exclude Octave since that would just expand the chart and make it a bit harder to read.
Since Go is not included in the CI along with Lua at the moment, Go values were interpolated based on the CI run at #55
Edit: Here is the actual interpolated CSV file I used to plot with: interp_benchmarks.csv
Here is a code to plot the benchmarks with PlotlyJS instead of Gadfly. It allows for interactivity such as automatic sorting of languages based on selected benchmarks.
# Producing the Julia Microbenchmarks plot
using CSV
using DataFrames
using PlotlyJS
using StatsBase
benchmarks =
CSV.read("interp_benchmarks.csv", DataFrame; header = ["language", "benchmark", "time"])
# Capitalize and decorate language names from datafile
dict = Dict(
"c" => "C",
"fortran" => "Fortran",
"go" => "Go",
"java" => "Java",
"javascript" => "JavaScript",
"julia" => "Julia",
"lua" => "LuaJIT",
"mathematica" => "Mathematica",
"matlab" => "Matlab",
"octave" => "Octave",
"python" => "Python",
"r" => "R",
"rust" => "Rust",
);
benchmarks[!, :language] = [dict[lang] for lang in benchmarks[!, :language]]
# Normalize benchmark times by C times
ctime = benchmarks[benchmarks[!, :language] .== "C", :]
benchmarks = innerjoin(benchmarks, ctime, on = :benchmark, makeunique = true)
select!(benchmarks, Not(:language_1))
rename!(benchmarks, :time_1 => :ctime)
benchmarks[!, :normtime] = benchmarks[!, :time] ./ benchmarks[!, :ctime];
plot(
benchmarks,
x = :language,
y = :normtime,
color = :benchmark,
mode = "markers",
Layout(
xaxis_type = "categorical",
xaxis_categoryorder = "mean ascending",
yaxis_type = "log",
xaxis_title = "",
yaxis_title = "",
),
)
plotly doesn't have support for sorting by geometric mean: See https://plotly.com/julia/reference/layout/xaxis/#layout-xaxis-categoryarray and the feature request. This makes it a bit rough for log scales, as the sorting is based on arithmetic mean.
I've been thinking about it and I believe that the plotting code should probably reside in this repo instead of the julia website codebase. Only the final benchmark svg file should be pushed to the website repo.
In terms of the website benchmark page tho, it might be pretty cool to have an embedded plotly instance for the benchmark graph, similar to what the plotly docs do. This would allow users to see/sort languages based on what benchmark they are most interested in. Some extra nonessential interactivity.
Just throwing some ideas.
From https://github.com/JuliaLang/Microbenchmarks/pull/62#issuecomment-1098006247
Would it be too crazy to just pull the performance timings right out of the Github Actions?
So basically like on every commit, get the benchmarks.csv
output from the CI and commit it to the repo? That should be doable, however, I'm not completely sold on the idea of having a update timing
commit for every other commit tbh. I think manually downloading the benchmarks.csv
file from the latest commit, whenever we need to update the timings/graph/table is prob the best method for now.
Maybe that is the easiest way to actually run the benchmarks.
For sure, Github Actions has been a boon.
The big issue would be that we can't get numbers for commercial software.
Yeah... How I'm currently handling this (to get the graph as shown here) is to interpolate the actual timings based on the ratios on the last known timing data for those languages which we don't have timings for. I'm not sure if publishing that kind of interpolated data on the JuliaLang website is honest (even with appropriate disclaimers), but I do think that our graph should contain data for those languages as no other benchmarks do. (I'm personally okay with this myself though, interpolated data is better than no data) There are options for CI as discussed here and if it comes down to it, I am still a student and have licenses for these commercial languages. I can try to run the tests myself on local hardware once I fix up tooling PRs such as this one.
We can just update the plot on the Julia website as well. It is really old.
While it is old, the information from the new graph is very similar to the previous graph. Rust and Julia both overtake Lua, but that's the only significant (trend) changes besides overall improvements in individual benchmarks. Let's try to 1) use interpolated data or 2) get commercial software working (CI/locally).
I'm totally fine making PRs to the julialang website with option 1) as a stopgap till we get updated data with 2).
I think it's fine if we do a single manual update to the csv/svg on the website, before automating benchmark updates (which might take a while longer to set up).
Interpolate the actual timings based on the ratios on the last known timing data for those languages which we don't have timings for.
For now, it's probably best to leave those languages out for now by simply not plotting their points. My subjective view is that showing updated but narrower benchmarks is (probably) more useful to users than showing out-of-date but broader benchmarks. Thoughts?
We could state: "Languages X, Y, and Z are not included in the latest benchmarks due to licensing issues, but you may view historical benchmarks comparing these languages to an older version of Julia by going to https://web.archive.org/web/20200807141540/https://julialang.org/benchmarks/"
What do you think?
the other approach would be to provide the out of date benchmarks for them. I think either would be acceptable.
My subjective view is that showing updated but narrower benchmarks is (probably) more useful to users than showing out-of-date but broader benchmarks.
I disagree with that. This is because after doing the interpolation the only change (trend wise) is Rust/Julia vs Lua. I don't think showing this is enough to justify dropping many languages. Remember interpolation also includes Fortran not just the closed source ones. As a user I don't want to click another link to find the data I want.
the other approach would be to provide the out of date benchmarks for them. I think either would be acceptable.
Agreed, which is what we are currently doing.
Basically, what I'm trying to get at is, we should not update if we gonna do it partially. If we update, we will do it properly.
Right now effort should be prioritized on #29, #58, & #64.
Wait, by interpolation, you just mean copying the datapoint from the old graph right? I think if it's just that–keeping performance ratios in the plot–is perfectly fine so long as this is described in the text.
I was more thinking about excluding Mathematica/MATLAB from the new plot, if their entire benchmark is out of date (but even this, I don't think it's a big deal to copy the old benchmarks). But not updating, and instead, interpolating specific benchmarks for languages where there is an issue (like the Fortran compilation issue) sounds pretty reasonable to me.
I was more thinking about excluding Mathematica/MATLAB from the new plot
I would still prefer not to do this, since having comparisons with these languages is rarely seen. It helps new Julia users coming from those closed ecosystems see the light ;) I'd be okay with interpolating this as it seems getting the CI for these is a bit out of our hands.
But not updating, and instead, interpolating specific benchmarks for languages where there is an issue (like the Fortran compilation issue) sounds pretty reasonable to me.
The reason why I don't like this is because we are essentially taking a shortcut in displaying the data. Especially since getting this to work is in our hands (compared to the Mathematica/Matlab issue). The onus is on us to fix our benchmarks, instead of throwing a rug on top of the actual issue and using older results.
In any case for a decision like this I would like for @StefanKarpinski and @ViralBShah to provide the final say.
SGTM!
I suppose I agree–having the Mathematica/MATLAB results is really useful for users from those domains of scientific computing. I think as long as everything is described in the text about exclusions/interpolations, we are fine.
I guess the question is: what is the purpose of these benchmarks? Is it a quantitative comparison table for attracting new users, or is it a scientific dataset of performance across languages? If it is the former, inclusion of these proprietary languages (even if the numbers are old) is really important to help demonstrate Julia's advantage against all other languages. If it is the latter, then having up-to-date and accurate numbers is most important, even if it means excluding some languages. In reality it's probably a combination, in which case this question is difficult to answer...
Pinging this as I just noticed the benchmarks page is still showing Julia 1.0.0. Can we put this up soon? I'm linking to the benchmarks in a paper (coming out in two days) as evidence of Julia being as fast as C++/Rust; would be great if the measurements were up-to-date :rocket:
Someone needs to set up all these environments and benchmark. I don't have any of these proprietary software licenses, for example.
Maybe we could just have a second panel of benchmarks on https://julialang.org/benchmarks/?
It has been over 3 years since the last full-scale benchmark, so I don't have high hopes anybody will get around to doing it soon. But it would be great to display Julia 1.8 benchmarks for all to see, though, at least somewhere we can link to.
We do have large github actions runners available in this org - which will help whenever we set that sort of thing up.
@MilesCranmer Would it help if you had commit access to the MicroBenchmarks and this repo so that you can directly edit to your liking?
EDIT: Sent you invite.
Is there a way to see these results: https://github.com/JuliaLang/Microbenchmarks/actions/runs/5567263800/workflow#L103? Did I understand correctly that there would've been a csv file generated that has since been deleted (because the logs have expired)?
Yes, I believe the logs get deleted, but perhaps we can run it again.
Since we have made Julia 1.6 the new LTS version, it might make sense to update the benchmark results in https://julialang.org/benchmarks/.