Shopify / yjit-metrics

"Tasks for benchmarking, building and collecting stats for YJIT"
MIT License
14 stars 9 forks source link

Use benchmarking framework to gather results for VMIL paper #2

Closed maximecb closed 3 years ago

maximecb commented 3 years ago

I'd really like to submit a short paper to this VMIL workshop, but the deadline is obviously quite close (Friday August 6th, less than 4 weeks away). It would be neat if we could use your benchmarking framework. If you're ok with that, I would like to entrust you with gathering benchmarking results and producing graphs for this project. If you feel like this would be too stressful, you can also say no, but I could definitely use your help, and I think it would be a nice exercise to put the tool you have built to work.

I'm going to describe the final results that I'd like to show in the paper and we can work backwards from there. Since the deadline is tight, we could use your framework to gather results that we dump into a CSV file, but the final graph generation can be done in a plain old spreadsheet program (preferably Google sheets so we can collaborate easily). Ideally, all the results gathering and compiling the data into CSV files should be scripted, because we may want to do small tweaks and rerun this. We may also update/fix the YJIT code and need to do a rerun.

The first thing is that we should exclude the really really micro benchmarks, that means exclude cfunc_itself, fib, getivar, setivar, and respond_to. These won't be taken seriously by the reviewers and will just detract from the other data.

We need a graph that the time taken by YJIT vs the time taken by MJIT and also by the interpreter. This needs to have error bar (standard deviation) for each. Ideally, the benchmarks should be sorted from left to right, with more "toy-like" benchmarks to the left and more real-world-like benchmarks to the right, like I did in my talk. activerecord would go just left of liquid-render.

I'd like to have something to showcase warmup time. For that, it could be cool if we could benchmark against both the Ruby interpreter and against MJIT. We could have a plot of the time taken by each after 1 iteration, 2 iterations, 3 iterations and so on, or just after 1, 10, 100 and 1000 or something like that. This needs to include error bars (standard deviation) for each platform. I think we could do this for only railsbench, which is our biggest, more realistic benchmark.

Could also be nice to have a table with the JIT coverage percentages for each benchmark, number of iseqs compiled, and also the inlined and outlined code size generated. Caveat here that if we want to be really rigorous we'd need to gather code size generated in a release build, because in debug builds we generate counters. That's a bit tricky to do, so we can just skip the code size if we are short on time.

All benchmarking needs to happen on AWS and we need to be fairly rigorous in double-checking that everything is accurate.

What do you think, are you in? :)

noahgibbs commented 3 years ago

Short answer: yup, let's do that.

I'll respond in a bit more detail tomorrow.

noahgibbs commented 3 years ago

I'll get started on this in the vmil_prep branch in this repo.

My initial list of benchmarks from most- to least-real would start like this (feel free to suggest amendments):

I'm not 100% sure for binarytrees or fannkuchredux, but my gut feeling is to exclude them.

I think the data collection can be a simple bash script using basic_benchmark.rb. That report will want to be custom, though. I'll get started on it.

maximecb commented 3 years ago

Thank you for getting started on this quickly. Much appreciated πŸ™

If we include nbody, then we should include binarytrees and fannkuchredux because they are in the same "category" (language shootout toy benchmarks). So I would maybe vote to just exclude nbody. That leaves us with 7 benchmarks + the results in production, which is good enough I think.

For the graphs we could simply sort them based on the number of iseqs compiled, or have lee and optcarrot more to the left since they are synthetic benchmarks (written for the purpose of being benchmarks, more or less, although optcarrot more so than lee).

I think the data collection can be a simple bash script using basic_benchmark.rb. That report will want to be custom, though. I'll get started on it.

That sounds like a good plan although you may need to integrate stddev calculation into your scripts. I'm not sure that we can compute the stddev on a speedup, so I think that we will want the average time after warmup for each implementation (CRuby interp, MJIT, YJIT) rather than a speedup measurement.

noahgibbs commented 3 years ago

Based on some quick Googling about error propagation (http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf - section on multiplication and division), it looks a lot like we can calculate the stddev of the speedup if we have the stddev of both components. Since we'll have stddev on interpreter time, MJIT time and YJIT time, I think that means we can do it.

Based on the paper, I believe I'm seeing that relative stddev of the speedup should be the sqrt of the sum of squares of the other two relative stddevs.

noahgibbs commented 3 years ago

Showing the standard deviation is going to be harder when we showcase warmup. Specifically, if we show individual iterations there's not a clear obvious way to calculate the standard deviation for/at a single sample. Though we could just show the overall standard deviation for that platform, to indicate a level of uncertainty.

We could also just show a whole bunch of samples on a simple coloured line or dot plot, which would both show the uncertainty (how close the dots are to each other) and the warmup (when/where the drops in time happen) in a fairly intuitive way. It'd look slightly messy, but it would be really obvious what we wanted to show them, including the error.

maximecb commented 3 years ago

Based on some quick Googling about error propagation (http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf - section on multiplication and division), it looks a lot like we can calculate the stddev of the speedup if we have the stddev of both components.

I'm not sure uncertainties and stddev work out the same. Can you double-check that? We need to be extremely rigorous with things of this sort.

Showing the standard deviation is going to be harder when we showcase warmup.

I think we have 3 options there:

noahgibbs commented 3 years ago

There is definitely a such thing as line graphs with stddev error bars. I'm just not sure where the stddev would come from. Calculating it for the whole run and then using the same (constant) interval around the whole line feels like cheating, but not in a good way. We could take little sections of the samples, average in that area and take the stddev just in that area -- but then it's going to be the stddev of a very small number of samples, so there will be a lot of error.

Regarding standard deviation, I've found several texts Googling "standard deviation propagation of error division" and so far they all agree (e.g. https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Supplemental_Modules_(Analytical_Chemistry)/Quantifying_Nature/Significant_Digits/Propagation_of_Error).

I'll see if I can find a more formal source, though.

Oh hey - according to Wikipedia this is a simplification of another formula (see https://en.wikipedia.org/wiki/Propagation_of_uncertainty/Example_formulae where f = A/B, right column). So it definitely reduces to that if we assume that the sources of error (YJIT time, MJIT time, raw CRuby time) vary independently rather than being correlated. But I think a lot of our math already requires that. But the Wikipedia reference is about standard deviation specifically, so I think the short answer is "yes, this formula is the one used for stddev."

maximecb commented 3 years ago

There is definitely a such thing as line graphs with stddev error bars. I'm just not sure where the stddev would come from.

If you have 20 runs over 100 iterations each, and you have the time values for each iteration, then you can compute an stddev for iteration 0, 1, 2, ..., N. You might need to run some curve smoothing with a sliding window to make the curves look less noisy though. This is standard practice AFAIK.

noahgibbs commented 3 years ago

Ah, okay. That makes sense -- so with 20 runs, the 16th sample would literally be 20 different 16th samples. Yeah, that's not bad.

noahgibbs commented 3 years ago

Okay. We have the basic report and basic data gathering for speedups, including the list of benchmarks in number-of-compiled-iseqs order.

For getting the inlined and outlined code sizes in a release YJIT, this is one way we could do it: https://github.com/Shopify/yjit/commit/0e63dbe0b5173543296c995dfbc987d9a8973436

I could have the yjit-metrics harness check for that endpoint and record the result if it's present. Then when I compile YJIT with that patch it'll be included. I'm not opposed to including something like that in main since it should have no performance impact. But it's not clear to me that we'd need to.

The other approach that comes to mind is that we could return a stats hash with just the sizes and other no-runtime-impact stats, rather than nil, in non-RUBY_DEBUG YJIT. That would technically be an interface change since somebody might be checking for that nil. But it seems like a reasonable one at first blush.

Let me know if you'd prefer one or the other. The least invasive answer is definitely "this is a hack, it shouldn't change the main branch of YJIT in any way, just run a patched version to collect code sizes." (Edit: not "once", I guess, but I could certainly patch the YJIT prod Ruby version with which I collect stats.)

noahgibbs commented 3 years ago

(Oh hey, I should put this here...)

Google Sheets actually has really bad support for error bars - no easy way to get the data on the bar itself from your numbers, only to set up a constant-for-the-series amount of error bar on each point. So to get error bars that vary per bar, you have to make every bar its own series (and people do :-( ).

Other graphing choices that come to mind:

noahgibbs commented 3 years ago

Here's the current CSV file generated with the report, so we can do early exploration of graphing solutions:

https://gist.github.com/noahgibbs/1eab9c5ace46465bdccdd1c2934cbb4b

maximecb commented 3 years ago

The other approach that comes to mind is that we could return a stats hash with just the sizes and other no-runtime-impact stats, rather than nil, in non-RUBY_DEBUG YJIT. That would technically be an interface change since somebody might be checking for that nil. But it seems like a reasonable one at first blush.

I guess I would prefer to return a different hash with most of the stats not available in release, but maybe it's fine to just patch your Ruby build just for the benchmarking, and not change our YJIT main? We can just "hack" this for the paper.

The table you generated looks good.

One difficulty though is that MJIT needs a lot more warmup. Technically, if you want the peak speedup, you would want to let it warm up until its done compiling anything so we can know what its peak performance is. I'm actually remembering that MJIT has an equivalent to our call threshold. --jit-min-calls I think. So maybe set the --jit-min-calls=10 for MJIT, same as our default --yjit-call-theshold=10, and make sure to give all benchmarks 20+ warmup iterations.

Again good work and very happy you are so on top of this πŸ‘

noahgibbs commented 3 years ago

I'll plan to return the hash in prod with fewer elements, then. Didn't want to change the API without at least talking to you first. I have a hack that works, but it won't be hard to do it the other way.

I'm collecting some MJIT data now to start getting a feel for its warmup. I'll mess around with its min-calls param as well. I'll probably have the early rough warmup report with very non-final data on Monday, at a guess.

I realised I should be collecting all this on a c5.metal with dedicated tenancy, so I've dumped an AMI for making one of those. So aside from all the other reasons these are pre-public numbers, we should use a dedicated instance for all the final benchmarking. That adds over $3.00/hour for a c5, so I'm not going to do all the early testing on a dedicated instance. I'll also need to get GNU screen working and/or figure out the problem with nohup, because we don't want to be dependent on a terminal staying connected for the bigger benchmarks.

maximecb commented 3 years ago

I'll mess around with its min-calls param as well.

I think having min-calls be the same value as the YJIT call-threshold, which is currently 10, is the safest bet. It seems more "fair".

I'm collecting some MJIT data now to start getting a feel for its warmup. I'll mess around with its min-calls param as well. I'll probably have the early rough warmup report with very non-final data on Monday, at a guess.

Sounds good :)

That adds over $3.00/hour for a c5

I think this is ok. Obviously don't leave it running overnight doing nothing, but don't make your own life unnecessarily hard either.

I'll also need to get GNU screen working and/or figure out the problem with nohup

I've been using tmux with no issues if that helps.

noahgibbs commented 3 years ago

By "messing around" with min-calls I mean I'll start generally characterising MJIT's behaviour. min-calls does something slightly different for MJIT than YJIT -- and even with --jit-wait there's still some timing oddness, since that's not really meant as a production configuration. I should probably test warmup with and without --jit-wait, so we at least check what MJIT is actually designed to do.

Also I should run it manually for awhile because --jit-wait doesn't necessarily work -- it waits (by default) 60 seconds, then if it can't get the compiled method it prints a warning and continues on its way. So: seems worth some playing around with, to make sure we're getting what we think we're getting.

Also: I think I've figured out the problem with nohup. I think it works fine other than not flushing output from subprocesses. I didn't see any progress from long-running benchmarks like railsbench or jekyll and thought they crashed, but instead now I think they just didn't flush any output until the end of the run. That's not bad, if so, and I might be able to fix it by manually setting STDOUT.sync and/or STDERR.sync.

noahgibbs commented 3 years ago

I've put in a PR for including the sizes in YJIT.runtime_stats in all configurations: https://github.com/Shopify/yjit/pull/132

noahgibbs commented 3 years ago

This issue is broad enough (data collection, MJIT warmup characterisation, graphing) that I'm going to split into sub-issues where appropriate. I've opened a "data collection" sub-issue: https://github.com/Shopify/yjit-metrics/issues/7

maximecb commented 3 years ago

Could you include the benchmarks 30k_methods and 30k_ifelse. They are synthetic benchmarks but they tell an interesting story about code size.

noahgibbs commented 3 years ago

Here's where we are, in minimal text-formatted form, on the warmup report right now:

YJIT Warmup Report:

      bench  samples   iter #1   iter #5  iter #10  RSD #1  RSD #5  RSD #10
-----------  -------  --------  --------  --------  ------  ------  -------
 psych-load       10  2580.1ms  2572.5ms  2577.2ms   0.05%   0.08%    0.07%
30k_methods        6   702.2ms   631.9ms             0.06%   0.03%

-----------  -------  --------  --------  --------  ------  ------  -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

MJIT Warmup Report:

      bench  samples   iter #1   iter #5  iter #10  RSD #1  RSD #5  RSD #10
-----------  -------  --------  --------  --------  ------  ------  -------
 psych-load       10  2894.8ms  2645.5ms  2467.7ms   0.05%   0.04%    0.09%
30k_methods        3  6864.9ms  6986.2ms             0.06%   0.11%

-----------  -------  --------  --------  --------  ------  ------  -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

No JIT Warmup Report:

      bench  samples   iter #1   iter #5  iter #10  RSD #1  RSD #5  RSD #10
-----------  -------  --------  --------  --------  ------  ------  -------
 psych-load       10  2777.1ms  2765.6ms  2796.4ms   0.07%   0.07%    0.07%
30k_methods        6  5709.5ms  5665.2ms             0.06%   0.04%

-----------  -------  --------  --------  --------  ------  ------  -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

Truffle Warmup Report:

      bench  samples    iter #1   iter #5  iter #10  RSD #1  RSD #5  RSD #10
-----------  -------  ---------  --------  --------  ------  ------  -------
 psych-load       10  12903.3ms  7678.4ms  7375.4ms   0.08%   0.07%    0.05%
30k_methods        8  15957.9ms  4319.5ms             0.02%   0.04%

-----------  -------  ---------  --------  --------  ------  ------  -------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
noahgibbs commented 3 years ago

These are trivial warmup CSV files, with Mac-generated data. But this is the intended format. Notice that in the same place the text tables have blanks (e.g. TruffleRuby 30k_methods iteration #10), the CSV files have empty strings.

https://gist.github.com/noahgibbs/07f78f91d65c5664b54698d3da35caae

maximecb commented 3 years ago

Looks good.

I'm going to provide some critical feedback, probably on the pedantic side, but I really want to make sure we get this 100% right.

RSD is relative standard deviation - the standard deviation divided by the mean of the series.

Being pedantic here, the Google states "[RSD] is expressed in percent and is obtained by multiplying the standard deviation by 100 and dividing this product by the average". Did you forget to multiply by 100, because the RSD numbers look very small? I would have expected RSD closer to 1 to 3% which is what yjit-metrics seems to yield on AWS. So 4-8% would make sense on your Mac.

I assume you already know this and you've only reported iterations 1, 5 and 10 for compactness in text form, but for the paper, we need to be able to produce this output for each iteration, 1, 2, 3, ..., N so we can graph it. For TruffleRuby, I expect the total warmup may take up to 50-100 iterations. Though going past 100 iterations is maybe not necessary. At that point, we've given them more than a fair chance to warm up. This is also why I would like the X axis in the graph to be in seconds, to convey to the reviewer how much real clock time this takes.

noahgibbs commented 3 years ago

Definitely storing all the iterations in the JSON files, and then running the report repeatedly against them. The report extracts specific iteration numbers, but they're all available.

D'oh! You're right on the relative stddev - I forgot to multiply by 100.0. Past time for me to put that in a method...

noahgibbs commented 3 years ago

Okay, corrected and with some additional data collected:

YJIT Warmup Report:

       bench  samples   iter #1   iter #5  iter #10  iter #50  iter #100  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100
------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
  psych-load       10  2580.1ms  2572.5ms  2577.2ms                        5.13%   7.77%    7.28%
 30k_methods       10   699.6ms   631.9ms                                  4.87%   2.80%
  30k_ifelse       10   396.1ms   271.1ms   275.1ms                        2.28%   2.80%    4.28%
activerecord       10   143.2ms   134.8ms   138.9ms   150.5ms    136.7ms   6.94%   5.07%    8.04%   17.01%     5.02%

------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

MJIT Warmup Report:

       bench  samples   iter #1   iter #5  iter #10  iter #50  iter #100  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100
------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
  psych-load       10  2894.8ms  2645.5ms  2467.7ms                        4.55%   3.95%    8.74%
 30k_methods       10  6660.1ms  6845.4ms                                  5.45%   7.53%
  30k_ifelse       10  2829.6ms  2940.8ms  2915.4ms                        9.42%   8.55%    7.81%
activerecord       10   184.2ms   192.1ms   185.4ms   183.3ms    197.0ms   6.64%   9.85%    4.68%    6.56%    13.91%

------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

No-JIT Warmup Report:

       bench  samples   iter #1   iter #5  iter #10  iter #50  iter #100  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100
------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
  psych-load       10  2777.1ms  2765.6ms  2796.4ms                        6.78%   6.87%    6.96%
 30k_methods       10  5682.2ms  5702.9ms                                  5.26%   4.80%
  30k_ifelse       10  2337.4ms  2287.5ms  2178.1ms                       10.47%  15.81%   11.75%
activerecord       10   167.7ms   168.3ms   171.3ms   164.5ms    164.2ms   6.19%   5.68%    7.01%    3.37%     3.73%

------------  -------  --------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

Truffle Warmup Report:

       bench  samples    iter #1   iter #5  iter #10  iter #50  iter #100  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100
------------  -------  ---------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
  psych-load       10  12903.3ms  7678.4ms  7375.4ms                        8.01%   7.04%    4.64%
 30k_methods       10  15976.6ms  4296.8ms                                  1.56%   3.99%
  30k_ifelse       10  12852.8ms  4083.5ms  3311.6ms                        2.36%   5.69%    3.61%
activerecord       10   1556.5ms   275.4ms   274.9ms   158.0ms    117.6ms   8.04%   8.86%    9.32%   12.56%     9.24%

------------  -------  ---------  --------  --------  --------  ---------  ------  ------  -------  -------  --------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
maximecb commented 3 years ago

Looks sensible πŸ‘

maximecb commented 3 years ago

I was thinking about this a little bit and, in in terms of warmup time, plotting this into a graph is going to be nontrivial. I would like it if the X axis could be seconds, but the natural unit that we have for the X axis is the number of iterations. It would also be nice if we could have a standard deviation in the plot, but that only makes sense if we can align all the measurements based on the iteration number.

Some ideas:

Alternatively, we can stick with the X axis being the number of iterations and the Y axis being time (normalized or not). Then we'd just have to give the reader some idea of how much time this took in the text.

If you would like some help with the plotting, I've never used the Ruby libraries for generating SVG, but maybe I could jump in. Could be an opportunity for me to improve my Ruby. I would just need access to some raw data as JSON.

noahgibbs commented 3 years ago

I'll give you a link right now to the (very early, not amazing) Mac data I collected yesterday. I'm about to start a much better collection run in line with your current request (ActiveRecord + RailsBench, 20-minute runs, 20ish runs) and I'll give you access to that data too, once it exists.

https://drive.google.com/file/d/1hFjY5fNr5sFaq1pX44RZ2y-ugU7ASb9g/view?usp=sharing

The data I've just linked should work fine if you unpack it in a directory and point basic_report at it ("./basic_report.rb -d partial_mac_data --all --report=vmil_warmup"), which could get you started with writing reporting code.

Here's a link to the 'victor' gem used by the lee benchmark, which is what I was planning to use for Ruby SVG output: https://github.com/DannyBen/victor

noahgibbs commented 3 years ago

I feel like 20 runs isn't enough for me to turn off ASLR, so I won't do that for today's run. But we may want to do that and have a longer run next week. But it ought to be easy to re-run the reporting with more runs and/or iterations-per-run.

maximecb commented 3 years ago

Okay going to play with that a bit today :)

noahgibbs commented 3 years ago

Now that I'm running with a newer YJIT I'm seeing code sizes recorded for non-debug YJIT Rubies. Yay! That messes up reporting, which currently thinks only YJIT-stats Rubies have stats. Boo! I'll work on that.

I have some 20-minute ActiveRecord runs recorded but not a ton of them. I'm turning off interpreter runs (don't care as much about characterising warmup) and YJIT prod runs (segfaults not-infrequently) and I'll let it run overnight, which should get us good TruffleRuby and MJIT warmup data, assuming no more crashes.

Trying to think if there's an easy way to share data incrementally. In Dropbox I'd put it in a shared folder. But Google Drive has a much less stable local client, so I don't want to just do it the same way. I could keep uploading .tar.bz2 files through my browser and sharing links, but that gets old fast.

noahgibbs commented 3 years ago

Here's the existing warmup report for the data I've collected so far today:

YJIT Warmup Report:

       bench  samples  iter #1  iter #5  iter #10  iter #50  iter #100  iter #500  iter #1000  iter #5000  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100  RSD #500  RSD #1000  RSD #5000
------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
activerecord        3  139.5ms  134.2ms   133.4ms   133.2ms    133.0ms    133.1ms     132.9ms     133.1ms   0.26%   0.14%    0.50%    0.42%     0.24%     0.23%      0.09%      0.26%

------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

MJIT Warmup Report:

       bench  samples  iter #1  iter #5  iter #10  iter #50  iter #100  iter #500  iter #1000  iter #5000  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100  RSD #500  RSD #1000  RSD #5000
------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
activerecord        3  167.9ms  166.5ms   166.4ms   166.5ms    166.6ms    166.3ms     166.5ms     166.5ms   1.05%   0.94%    0.98%    1.14%     1.06%     1.06%      1.23%      1.00%

------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

Truffle Warmup Report:

       bench  samples   iter #1   iter #5  iter #10  iter #50  iter #100  iter #500  iter #1000  iter #5000  iter #10000  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100  RSD #500  RSD #1000  RSD #5000  RSD #10000
------------  -------  --------  --------  --------  --------  ---------  ---------  ----------  ----------  -----------  ------  ------  -------  -------  --------  --------  ---------  ---------  ----------
activerecord        5  9869.4ms  2851.1ms  1506.9ms  1490.3ms    924.8ms    103.4ms      99.8ms     103.5ms      604.7ms   6.37%   7.80%   29.25%   52.59%    42.09%     9.47%      6.85%      8.23%      14.94%

------------  -------  --------  --------  --------  --------  ---------  ---------  ----------  ----------  -----------  ------  ------  -------  -------  --------  --------  ---------  ---------  ----------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.
noahgibbs commented 3 years ago

And here are the data files I used to get it: https://drive.google.com/file/d/1rzgqOiOYpOKMy96oPHtmphADfMTx7AQ5/view?usp=sharing

noahgibbs commented 3 years ago

After the most recent runs, here's where we are for MJIT and TruffleRuby data:

MJIT Warmup Report:

       bench  samples  iter #1  iter #5  iter #10  iter #50  iter #100  iter #500  iter #1000  iter #5000  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100  RSD #500  RSD #1000  RSD #5000
------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
activerecord       25  170.9ms  166.8ms   166.9ms   167.8ms    166.7ms    167.2ms     166.9ms     167.1ms   4.43%   0.67%    0.66%    2.87%     0.65%     1.06%      0.69%      0.86%

------------  -------  -------  -------  --------  --------  ---------  ---------  ----------  ----------  ------  ------  -------  -------  --------  --------  ---------  ---------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

Truffle Warmup Report:

       bench  samples   iter #1   iter #5  iter #10  iter #50  iter #100  iter #500  iter #1000  iter #5000  iter #10000  RSD #1  RSD #5  RSD #10  RSD #50  RSD #100  RSD #500  RSD #1000  RSD #5000  RSD #10000
------------  -------  --------  --------  --------  --------  ---------  ---------  ----------  ----------  -----------  ------  ------  -------  -------  --------  --------  ---------  ---------  ----------
activerecord       27  9327.6ms  3017.6ms  1618.8ms  1167.5ms    765.0ms    102.1ms     104.8ms      99.8ms               12.72%   6.88%   24.31%   47.65%    42.39%     8.02%      8.90%      7.01%

------------  -------  --------  --------  --------  --------  ---------  ---------  ----------  ----------  -----------  ------  ------  -------  -------  --------  --------  ---------  ---------  ----------
Each iteration is a set of samples of that iteration in a series.
RSD is relative standard deviation - the standard deviation divided by the mean of the series.
Samples is the number of runs (samples taken) for each specific iteration number.

So those are fairly high stddevs, even for later iterations, for TruffleRuby. The blank columns for iteration 10k are because some runs had them and some didn't and I didn't want to report with fewer samples while claiming full quality.

But as you pointed out, the iterations are definitely more of a probability distribution than a sample, especially for Truffle and especially for earlier iterations.

I'll post the data for the other runs later -- today's technically a day off, I just wanted to be sure to not run the AWS instance all weekend.

noahgibbs commented 3 years ago

Also, the massive RSDs around iteration 50-100 may be a sign of that alternation between fast and slow that we talked about Truffle doing before. I'll need to examine the runs in detail to know for sure, and I haven't yet.

noahgibbs commented 3 years ago

Here's the data from the weekend: https://drive.google.com/file/d/1bw8C-4HMZ62miSeUcVil1MM0v2WszJXG/view?usp=sharing

I've taken down the older link, but all the AWS data files from there are included in the new archive.

maximecb commented 3 years ago

Thanks again for some great work. I think it's ok to cut everything at 1000 iterations for future runs. Should save a little bit of time.

Next we'll need to think about how to format this into a latex table. IMO this should be done by a script. Might be nice to include the mean amount of total time it takes YJIT/MJIT/TR to get to iteration 1, 5, 10, 100, ... as well, since that will give people some idea how long the warmup actually takes. Potentially I could do that myself since you might be busy gathering warmup results for railsbench, and we still need to do a complete run to calculate the time on every benchmark.

noahgibbs commented 3 years ago

For speedup data collection, I suspect we're going to want different numbers of warmup iterations for different benchmarks. 1000 iterations is about where TruffleRuby has reliably warmed up on the ActiveRecord benchmark, but 1000 iterations of RailsBench or Jekyll takes a very long time.

ActiveRecord is the fastest at something like 150-180ms depending on the Ruby config, while Jekyll can be in the range of 8-10 seconds per iteration.

So: I'll collect warmup data for the other benchmarks we plan to do a speed comparison on, then update the collector script to grab a proportional "chunk" of results with correct warmup iterations, etc. Then I can basically run it in a loop and the results will slowly improve over time as we collect more data.