Closed maximecb closed 3 years ago
Short answer: yup, let's do that.
I'll respond in a bit more detail tomorrow.
I'll get started on this in the vmil_prep branch in this repo.
My initial list of benchmarks from most- to least-real would start like this (feel free to suggest amendments):
I'm not 100% sure for binarytrees or fannkuchredux, but my gut feeling is to exclude them.
I think the data collection can be a simple bash script using basic_benchmark.rb. That report will want to be custom, though. I'll get started on it.
Thank you for getting started on this quickly. Much appreciated π
If we include nbody, then we should include binarytrees and fannkuchredux because they are in the same "category" (language shootout toy benchmarks). So I would maybe vote to just exclude nbody. That leaves us with 7 benchmarks + the results in production, which is good enough I think.
For the graphs we could simply sort them based on the number of iseqs compiled, or have lee and optcarrot more to the left since they are synthetic benchmarks (written for the purpose of being benchmarks, more or less, although optcarrot more so than lee).
I think the data collection can be a simple bash script using basic_benchmark.rb. That report will want to be custom, though. I'll get started on it.
That sounds like a good plan although you may need to integrate stddev calculation into your scripts. I'm not sure that we can compute the stddev on a speedup, so I think that we will want the average time after warmup for each implementation (CRuby interp, MJIT, YJIT) rather than a speedup measurement.
Based on some quick Googling about error propagation (http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf - section on multiplication and division), it looks a lot like we can calculate the stddev of the speedup if we have the stddev of both components. Since we'll have stddev on interpreter time, MJIT time and YJIT time, I think that means we can do it.
Based on the paper, I believe I'm seeing that relative stddev of the speedup should be the sqrt of the sum of squares of the other two relative stddevs.
Showing the standard deviation is going to be harder when we showcase warmup. Specifically, if we show individual iterations there's not a clear obvious way to calculate the standard deviation for/at a single sample. Though we could just show the overall standard deviation for that platform, to indicate a level of uncertainty.
We could also just show a whole bunch of samples on a simple coloured line or dot plot, which would both show the uncertainty (how close the dots are to each other) and the warmup (when/where the drops in time happen) in a fairly intuitive way. It'd look slightly messy, but it would be really obvious what we wanted to show them, including the error.
Based on some quick Googling about error propagation (http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf - section on multiplication and division), it looks a lot like we can calculate the stddev of the speedup if we have the stddev of both components.
I'm not sure uncertainties and stddev work out the same. Can you double-check that? We need to be extremely rigorous with things of this sort.
Showing the standard deviation is going to be harder when we showcase warmup.
I think we have 3 options there:
There is definitely a such thing as line graphs with stddev error bars. I'm just not sure where the stddev would come from. Calculating it for the whole run and then using the same (constant) interval around the whole line feels like cheating, but not in a good way. We could take little sections of the samples, average in that area and take the stddev just in that area -- but then it's going to be the stddev of a very small number of samples, so there will be a lot of error.
Regarding standard deviation, I've found several texts Googling "standard deviation propagation of error division" and so far they all agree (e.g. https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Supplemental_Modules_(Analytical_Chemistry)/Quantifying_Nature/Significant_Digits/Propagation_of_Error).
I'll see if I can find a more formal source, though.
Oh hey - according to Wikipedia this is a simplification of another formula (see https://en.wikipedia.org/wiki/Propagation_of_uncertainty/Example_formulae where f = A/B, right column). So it definitely reduces to that if we assume that the sources of error (YJIT time, MJIT time, raw CRuby time) vary independently rather than being correlated. But I think a lot of our math already requires that. But the Wikipedia reference is about standard deviation specifically, so I think the short answer is "yes, this formula is the one used for stddev."
There is definitely a such thing as line graphs with stddev error bars. I'm just not sure where the stddev would come from.
If you have 20 runs over 100 iterations each, and you have the time values for each iteration, then you can compute an stddev for iteration 0, 1, 2, ..., N. You might need to run some curve smoothing with a sliding window to make the curves look less noisy though. This is standard practice AFAIK.
Ah, okay. That makes sense -- so with 20 runs, the 16th sample would literally be 20 different 16th samples. Yeah, that's not bad.
Okay. We have the basic report and basic data gathering for speedups, including the list of benchmarks in number-of-compiled-iseqs order.
For getting the inlined and outlined code sizes in a release YJIT, this is one way we could do it: https://github.com/Shopify/yjit/commit/0e63dbe0b5173543296c995dfbc987d9a8973436
I could have the yjit-metrics harness check for that endpoint and record the result if it's present. Then when I compile YJIT with that patch it'll be included. I'm not opposed to including something like that in main since it should have no performance impact. But it's not clear to me that we'd need to.
The other approach that comes to mind is that we could return a stats hash with just the sizes and other no-runtime-impact stats, rather than nil, in non-RUBY_DEBUG YJIT. That would technically be an interface change since somebody might be checking for that nil. But it seems like a reasonable one at first blush.
Let me know if you'd prefer one or the other. The least invasive answer is definitely "this is a hack, it shouldn't change the main branch of YJIT in any way, just run a patched version to collect code sizes." (Edit: not "once", I guess, but I could certainly patch the YJIT prod Ruby version with which I collect stats.)
(Oh hey, I should put this here...)
Google Sheets actually has really bad support for error bars - no easy way to get the data on the bar itself from your numbers, only to set up a constant-for-the-series amount of error bar on each point. So to get error bars that vary per bar, you have to make every bar its own series (and people do :-( ).
Other graphing choices that come to mind:
Here's the current CSV file generated with the report, so we can do early exploration of graphing solutions:
https://gist.github.com/noahgibbs/1eab9c5ace46465bdccdd1c2934cbb4b
The other approach that comes to mind is that we could return a stats hash with just the sizes and other no-runtime-impact stats, rather than nil, in non-RUBY_DEBUG YJIT. That would technically be an interface change since somebody might be checking for that nil. But it seems like a reasonable one at first blush.
I guess I would prefer to return a different hash with most of the stats not available in release, but maybe it's fine to just patch your Ruby build just for the benchmarking, and not change our YJIT main? We can just "hack" this for the paper.
The table you generated looks good.
One difficulty though is that MJIT needs a lot more warmup. Technically, if you want the peak speedup, you would want to let it warm up until its done compiling anything so we can know what its peak performance is. I'm actually remembering that MJIT has an equivalent to our call threshold. --jit-min-calls I think. So maybe set the --jit-min-calls=10 for MJIT, same as our default --yjit-call-theshold=10, and make sure to give all benchmarks 20+ warmup iterations.
Again good work and very happy you are so on top of this π
I'll plan to return the hash in prod with fewer elements, then. Didn't want to change the API without at least talking to you first. I have a hack that works, but it won't be hard to do it the other way.
I'm collecting some MJIT data now to start getting a feel for its warmup. I'll mess around with its min-calls param as well. I'll probably have the early rough warmup report with very non-final data on Monday, at a guess.
I realised I should be collecting all this on a c5.metal with dedicated tenancy, so I've dumped an AMI for making one of those. So aside from all the other reasons these are pre-public numbers, we should use a dedicated instance for all the final benchmarking. That adds over $3.00/hour for a c5, so I'm not going to do all the early testing on a dedicated instance. I'll also need to get GNU screen working and/or figure out the problem with nohup, because we don't want to be dependent on a terminal staying connected for the bigger benchmarks.
I'll mess around with its min-calls param as well.
I think having min-calls be the same value as the YJIT call-threshold, which is currently 10, is the safest bet. It seems more "fair".
I'm collecting some MJIT data now to start getting a feel for its warmup. I'll mess around with its min-calls param as well. I'll probably have the early rough warmup report with very non-final data on Monday, at a guess.
Sounds good :)
That adds over $3.00/hour for a c5
I think this is ok. Obviously don't leave it running overnight doing nothing, but don't make your own life unnecessarily hard either.
I'll also need to get GNU screen working and/or figure out the problem with nohup
I've been using tmux with no issues if that helps.
By "messing around" with min-calls I mean I'll start generally characterising MJIT's behaviour. min-calls does something slightly different for MJIT than YJIT -- and even with --jit-wait there's still some timing oddness, since that's not really meant as a production configuration. I should probably test warmup with and without --jit-wait, so we at least check what MJIT is actually designed to do.
Also I should run it manually for awhile because --jit-wait doesn't necessarily work -- it waits (by default) 60 seconds, then if it can't get the compiled method it prints a warning and continues on its way. So: seems worth some playing around with, to make sure we're getting what we think we're getting.
Also: I think I've figured out the problem with nohup. I think it works fine other than not flushing output from subprocesses. I didn't see any progress from long-running benchmarks like railsbench or jekyll and thought they crashed, but instead now I think they just didn't flush any output until the end of the run. That's not bad, if so, and I might be able to fix it by manually setting STDOUT.sync and/or STDERR.sync.
I've put in a PR for including the sizes in YJIT.runtime_stats in all configurations: https://github.com/Shopify/yjit/pull/132
This issue is broad enough (data collection, MJIT warmup characterisation, graphing) that I'm going to split into sub-issues where appropriate. I've opened a "data collection" sub-issue: https://github.com/Shopify/yjit-metrics/issues/7
Could you include the benchmarks 30k_methods and 30k_ifelse. They are synthetic benchmarks but they tell an interesting story about code size.
Here's where we are, in minimal text-formatted form, on the warmup report right now:
YJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 RSD #1 RSD #5 RSD #10
----------- ------- -------- -------- -------- ------ ------ -------
psych-load 10 2580.1ms 2572.5ms 2577.2ms 0.05% 0.08% 0.07%
30k_methods 6 702.2ms 631.9ms 0.06% 0.03%
----------- ------- -------- -------- -------- ------ ------ -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
MJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 RSD #1 RSD #5 RSD #10
----------- ------- -------- -------- -------- ------ ------ -------
psych-load 10 2894.8ms 2645.5ms 2467.7ms 0.05% 0.04% 0.09%
30k_methods 3 6864.9ms 6986.2ms 0.06% 0.11%
----------- ------- -------- -------- -------- ------ ------ -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
No JIT Warmup Report:
bench samples iter #1 iter #5 iter #10 RSD #1 RSD #5 RSD #10
----------- ------- -------- -------- -------- ------ ------ -------
psych-load 10 2777.1ms 2765.6ms 2796.4ms 0.07% 0.07% 0.07%
30k_methods 6 5709.5ms 5665.2ms 0.06% 0.04%
----------- ------- -------- -------- -------- ------ ------ -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
Truffle Warmup Report:
bench samples iter #1 iter #5 iter #10 RSD #1 RSD #5 RSD #10
----------- ------- --------- -------- -------- ------ ------ -------
psych-load 10 12903.3ms 7678.4ms 7375.4ms 0.08% 0.07% 0.05%
30k_methods 8 15957.9ms 4319.5ms 0.02% 0.04%
----------- ------- --------- -------- -------- ------ ------ -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
These are trivial warmup CSV files, with Mac-generated data. But this is the intended format. Notice that in the same place the text tables have blanks (e.g. TruffleRuby 30k_methods iteration #10), the CSV files have empty strings.
https://gist.github.com/noahgibbs/07f78f91d65c5664b54698d3da35caae
Looks good.
I'm going to provide some critical feedback, probably on the pedantic side, but I really want to make sure we get this 100% right.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Being pedantic here, the Google states "[RSD] is expressed in percent and is obtained by multiplying the standard deviation by 100 and dividing this product by the average". Did you forget to multiply by 100, because the RSD numbers look very small? I would have expected RSD closer to 1 to 3% which is what yjit-metrics seems to yield on AWS. So 4-8% would make sense on your Mac.
I assume you already know this and you've only reported iterations 1, 5 and 10 for compactness in text form, but for the paper, we need to be able to produce this output for each iteration, 1, 2, 3, ..., N so we can graph it. For TruffleRuby, I expect the total warmup may take up to 50-100 iterations. Though going past 100 iterations is maybe not necessary. At that point, we've given them more than a fair chance to warm up. This is also why I would like the X axis in the graph to be in seconds, to convey to the reviewer how much real clock time this takes.
Definitely storing all the iterations in the JSON files, and then running the report repeatedly against them. The report extracts specific iteration numbers, but they're all available.
D'oh! You're right on the relative stddev - I forgot to multiply by 100.0. Past time for me to put that in a method...
Okay, corrected and with some additional data collected:
YJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
psych-load 10 2580.1ms 2572.5ms 2577.2ms 5.13% 7.77% 7.28%
30k_methods 10 699.6ms 631.9ms 4.87% 2.80%
30k_ifelse 10 396.1ms 271.1ms 275.1ms 2.28% 2.80% 4.28%
activerecord 10 143.2ms 134.8ms 138.9ms 150.5ms 136.7ms 6.94% 5.07% 8.04% 17.01% 5.02%
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
MJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
psych-load 10 2894.8ms 2645.5ms 2467.7ms 4.55% 3.95% 8.74%
30k_methods 10 6660.1ms 6845.4ms 5.45% 7.53%
30k_ifelse 10 2829.6ms 2940.8ms 2915.4ms 9.42% 8.55% 7.81%
activerecord 10 184.2ms 192.1ms 185.4ms 183.3ms 197.0ms 6.64% 9.85% 4.68% 6.56% 13.91%
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
No-JIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
psych-load 10 2777.1ms 2765.6ms 2796.4ms 6.78% 6.87% 6.96%
30k_methods 10 5682.2ms 5702.9ms 5.26% 4.80%
30k_ifelse 10 2337.4ms 2287.5ms 2178.1ms 10.47% 15.81% 11.75%
activerecord 10 167.7ms 168.3ms 171.3ms 164.5ms 164.2ms 6.19% 5.68% 7.01% 3.37% 3.73%
------------ ------- -------- -------- -------- -------- --------- ------ ------ ------- ------- --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
Truffle Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100
------------ ------- --------- -------- -------- -------- --------- ------ ------ ------- ------- --------
psych-load 10 12903.3ms 7678.4ms 7375.4ms 8.01% 7.04% 4.64%
30k_methods 10 15976.6ms 4296.8ms 1.56% 3.99%
30k_ifelse 10 12852.8ms 4083.5ms 3311.6ms 2.36% 5.69% 3.61%
activerecord 10 1556.5ms 275.4ms 274.9ms 158.0ms 117.6ms 8.04% 8.86% 9.32% 12.56% 9.24%
------------ ------- --------- -------- -------- -------- --------- ------ ------ ------- ------- --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
Looks sensible π
I was thinking about this a little bit and, in in terms of warmup time, plotting this into a graph is going to be nontrivial. I would like it if the X axis could be seconds, but the natural unit that we have for the X axis is the number of iterations. It would also be nice if we could have a standard deviation in the plot, but that only makes sense if we can align all the measurements based on the iteration number.
Some ideas:
Alternatively, we can stick with the X axis being the number of iterations and the Y axis being time (normalized or not). Then we'd just have to give the reader some idea of how much time this took in the text.
If you would like some help with the plotting, I've never used the Ruby libraries for generating SVG, but maybe I could jump in. Could be an opportunity for me to improve my Ruby. I would just need access to some raw data as JSON.
I'll give you a link right now to the (very early, not amazing) Mac data I collected yesterday. I'm about to start a much better collection run in line with your current request (ActiveRecord + RailsBench, 20-minute runs, 20ish runs) and I'll give you access to that data too, once it exists.
https://drive.google.com/file/d/1hFjY5fNr5sFaq1pX44RZ2y-ugU7ASb9g/view?usp=sharing
The data I've just linked should work fine if you unpack it in a directory and point basic_report at it ("./basic_report.rb -d partial_mac_data --all --report=vmil_warmup"), which could get you started with writing reporting code.
Here's a link to the 'victor' gem used by the lee benchmark, which is what I was planning to use for Ruby SVG output: https://github.com/DannyBen/victor
I feel like 20 runs isn't enough for me to turn off ASLR, so I won't do that for today's run. But we may want to do that and have a longer run next week. But it ought to be easy to re-run the reporting with more runs and/or iterations-per-run.
Okay going to play with that a bit today :)
Now that I'm running with a newer YJIT I'm seeing code sizes recorded for non-debug YJIT Rubies. Yay! That messes up reporting, which currently thinks only YJIT-stats Rubies have stats. Boo! I'll work on that.
I have some 20-minute ActiveRecord runs recorded but not a ton of them. I'm turning off interpreter runs (don't care as much about characterising warmup) and YJIT prod runs (segfaults not-infrequently) and I'll let it run overnight, which should get us good TruffleRuby and MJIT warmup data, assuming no more crashes.
Trying to think if there's an easy way to share data incrementally. In Dropbox I'd put it in a shared folder. But Google Drive has a much less stable local client, so I don't want to just do it the same way. I could keep uploading .tar.bz2 files through my browser and sharing links, but that gets old fast.
Here's the existing warmup report for the data I've collected so far today:
YJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 iter #500 iter #1000 iter #5000 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100 RSD #500 RSD #1000 RSD #5000
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
activerecord 3 139.5ms 134.2ms 133.4ms 133.2ms 133.0ms 133.1ms 132.9ms 133.1ms 0.26% 0.14% 0.50% 0.42% 0.24% 0.23% 0.09% 0.26%
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
MJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 iter #500 iter #1000 iter #5000 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100 RSD #500 RSD #1000 RSD #5000
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
activerecord 3 167.9ms 166.5ms 166.4ms 166.5ms 166.6ms 166.3ms 166.5ms 166.5ms 1.05% 0.94% 0.98% 1.14% 1.06% 1.06% 1.23% 1.00%
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
Truffle Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 iter #500 iter #1000 iter #5000 iter #10000 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100 RSD #500 RSD #1000 RSD #5000 RSD #10000
------------ ------- -------- -------- -------- -------- --------- --------- ---------- ---------- ----------- ------ ------ ------- ------- -------- -------- --------- --------- ----------
activerecord 5 9869.4ms 2851.1ms 1506.9ms 1490.3ms 924.8ms 103.4ms 99.8ms 103.5ms 604.7ms 6.37% 7.80% 29.25% 52.59% 42.09% 9.47% 6.85% 8.23% 14.94%
------------ ------- -------- -------- -------- -------- --------- --------- ---------- ---------- ----------- ------ ------ ------- ------- -------- -------- --------- --------- ----------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
And here are the data files I used to get it: https://drive.google.com/file/d/1rzgqOiOYpOKMy96oPHtmphADfMTx7AQ5/view?usp=sharing
After the most recent runs, here's where we are for MJIT and TruffleRuby data:
MJIT Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 iter #500 iter #1000 iter #5000 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100 RSD #500 RSD #1000 RSD #5000
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
activerecord 25 170.9ms 166.8ms 166.9ms 167.8ms 166.7ms 167.2ms 166.9ms 167.1ms 4.43% 0.67% 0.66% 2.87% 0.65% 1.06% 0.69% 0.86%
------------ ------- ------- ------- -------- -------- --------- --------- ---------- ---------- ------ ------ ------- ------- -------- -------- --------- ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
Truffle Warmup Report:
bench samples iter #1 iter #5 iter #10 iter #50 iter #100 iter #500 iter #1000 iter #5000 iter #10000 RSD #1 RSD #5 RSD #10 RSD #50 RSD #100 RSD #500 RSD #1000 RSD #5000 RSD #10000
------------ ------- -------- -------- -------- -------- --------- --------- ---------- ---------- ----------- ------ ------ ------- ------- -------- -------- --------- --------- ----------
activerecord 27 9327.6ms 3017.6ms 1618.8ms 1167.5ms 765.0ms 102.1ms 104.8ms 99.8ms 12.72% 6.88% 24.31% 47.65% 42.39% 8.02% 8.90% 7.01%
------------ ------- -------- -------- -------- -------- --------- --------- ---------- ---------- ----------- ------ ------ ------- ------- -------- -------- --------- --------- ----------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
So those are fairly high stddevs, even for later iterations, for TruffleRuby. The blank columns for iteration 10k are because some runs had them and some didn't and I didn't want to report with fewer samples while claiming full quality.
But as you pointed out, the iterations are definitely more of a probability distribution than a sample, especially for Truffle and especially for earlier iterations.
I'll post the data for the other runs later -- today's technically a day off, I just wanted to be sure to not run the AWS instance all weekend.
Also, the massive RSDs around iteration 50-100 may be a sign of that alternation between fast and slow that we talked about Truffle doing before. I'll need to examine the runs in detail to know for sure, and I haven't yet.
Here's the data from the weekend: https://drive.google.com/file/d/1bw8C-4HMZ62miSeUcVil1MM0v2WszJXG/view?usp=sharing
I've taken down the older link, but all the AWS data files from there are included in the new archive.
Thanks again for some great work. I think it's ok to cut everything at 1000 iterations for future runs. Should save a little bit of time.
Next we'll need to think about how to format this into a latex table. IMO this should be done by a script. Might be nice to include the mean amount of total time it takes YJIT/MJIT/TR to get to iteration 1, 5, 10, 100, ... as well, since that will give people some idea how long the warmup actually takes. Potentially I could do that myself since you might be busy gathering warmup results for railsbench, and we still need to do a complete run to calculate the time on every benchmark.
For speedup data collection, I suspect we're going to want different numbers of warmup iterations for different benchmarks. 1000 iterations is about where TruffleRuby has reliably warmed up on the ActiveRecord benchmark, but 1000 iterations of RailsBench or Jekyll takes a very long time.
ActiveRecord is the fastest at something like 150-180ms depending on the Ruby config, while Jekyll can be in the range of 8-10 seconds per iteration.
So: I'll collect warmup data for the other benchmarks we plan to do a speed comparison on, then update the collector script to grab a proportional "chunk" of results with correct warmup iterations, etc. Then I can basically run it in a loop and the results will slowly improve over time as we collect more data.
I'd really like to submit a short paper to this VMIL workshop, but the deadline is obviously quite close (Friday August 6th, less than 4 weeks away). It would be neat if we could use your benchmarking framework. If you're ok with that, I would like to entrust you with gathering benchmarking results and producing graphs for this project. If you feel like this would be too stressful, you can also say no, but I could definitely use your help, and I think it would be a nice exercise to put the tool you have built to work.
I'm going to describe the final results that I'd like to show in the paper and we can work backwards from there. Since the deadline is tight, we could use your framework to gather results that we dump into a CSV file, but the final graph generation can be done in a plain old spreadsheet program (preferably Google sheets so we can collaborate easily). Ideally, all the results gathering and compiling the data into CSV files should be scripted, because we may want to do small tweaks and rerun this. We may also update/fix the YJIT code and need to do a rerun.
The first thing is that we should exclude the really really micro benchmarks, that means exclude cfunc_itself, fib, getivar, setivar, and respond_to. These won't be taken seriously by the reviewers and will just detract from the other data.
We need a graph that the time taken by YJIT vs the time taken by MJIT and also by the interpreter. This needs to have error bar (standard deviation) for each. Ideally, the benchmarks should be sorted from left to right, with more "toy-like" benchmarks to the left and more real-world-like benchmarks to the right, like I did in my talk. activerecord would go just left of liquid-render.
I'd like to have something to showcase warmup time. For that, it could be cool if we could benchmark against both the Ruby interpreter and against MJIT. We could have a plot of the time taken by each after 1 iteration, 2 iterations, 3 iterations and so on, or just after 1, 10, 100 and 1000 or something like that. This needs to include error bars (standard deviation) for each platform. I think we could do this for only railsbench, which is our biggest, more realistic benchmark.
Could also be nice to have a table with the JIT coverage percentages for each benchmark, number of iseqs compiled, and also the inlined and outlined code size generated. Caveat here that if we want to be really rigorous we'd need to gather code size generated in a release build, because in debug builds we generate counters. That's a bit tricky to do, so we can just skip the code size if we are short on time.
All benchmarking needs to happen on AWS and we need to be fairly rigorous in double-checking that everything is accurate.
What do you think, are you in? :)