ccristian / caliper

Automatically exported from code.google.com/p/caliper
Apache License 2.0
0 stars 0 forks source link

Provide a fast feedback loop for developing benchmarks #237

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run example
2. Read console results

What is the expected output? What do you see instead?
Old version of caliper showed microbenchmark results in console.
New version seems to only show the number of tests run, etc.

What version of the product are you using? On what operating system?
 a3703ae4544b

Please provide any additional information below.

Original issue reported on code.google.com by agrothberg on 8 Apr 2013 at 10:04

GoogleCodeExporter commented 9 years ago
I used to get results like this:

length     us linear runtime
   100   5.48 =
  1000  56.58 ===
 10000 553.78 ==============================

Original comment by agrothberg on 9 Apr 2013 at 8:33

GoogleCodeExporter commented 9 years ago
The old output was done by ConsoleReport.

As an example of this mussing code, look at references to: LinearTranslation. 
The old ConsoleReport used LinearTranslation. At present he only code to 
reference LinearTranslation is test code.

Original comment by agrothberg on 10 Apr 2013 at 12:50

GoogleCodeExporter commented 9 years ago
We've made a decision to remove results from the console output as there is no 
distillation of the information collected by Caliper that is both compact 
enough to display on the console, but contains all of the information that 
would be required to make _informed_ decisions based on the results.  The 
latest version of Caliper now posts to the updated webapp at 
microbenchmarks.appspot.com, and displays the URL at the end of the run.  We're 
excited to get feedback on the new process, so please give it a try and let us 
know how it works for you.

Original comment by gak@google.com on 11 Apr 2013 at 11:06

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I used the old version of caliper (which had ConsoleReport) quite a bit for 
microbenchmarking and I found the results quite informative.

I would get results like this:

 benchmark    ns linear runtime
  NanoTime 0.453 ==============================
SystemTime 0.221 ===============

which were pretty informative in making decisions.

I ran the new version of caliper and had the results from that benchmark run 
uploaded to the app.

By in large, I find the content on the web page to be the same as the console. 
Those text based bar charts have been replaced with more colorful bars, but the 
results are more or less the same. I do understand that there is now additonal 
meta data, but for the vast majority of my work, this new API seems to add 
overhead to my workflow without adding significant value.

While the web reporting might be the primary way of aggregating and reporting 
test results, I would argue that keeping the old console output for faster 
iterations of benchmarks would be very nice to have. I would even be willing to 
take a stab at updating the old ConsoleReport to implement ResultProcessor.

That being said, I do have a few comments on the new web system. One of them 
being it isn't clear as to be how to delete old "runs". Without this 
functionarlity the UI will quickly become cluttered. 

Nor is it clear how to share my results with others (which I would see being a 
major advbantage of the web app vs. console).

Original comment by agrothberg on 12 Apr 2013 at 3:18

GoogleCodeExporter commented 9 years ago
"I would even be willing to take a stab at updating the old ConsoleReport to 
implement ResultProcessor."

Perfect!  Do that and add a line for it to your config.properties and you're 
done!  I would also like us to have a contrib area where you could share that 
class with others, but Greg and I haven't talked about it much.

That the "results are more or less the same" will only be true in the most dead 
simple cases.

Original comment by kevinb@google.com on 13 Apr 2013 at 6:53

GoogleCodeExporter commented 9 years ago
I wish I could reply inline…

Re replaced with more colorful bars:
It's true that they're more colorful, but also more precise!  Your monitor has 
better resolution than a string of ='s and it was distressing how often quite 
different numbers ended up with the same length bars for that reason.  I'll 
also mention that this is by no means the target for these graphs.  I'm in the 
process of adding error bars and the like back into the results so that you 
truly getting _more_ information.  Plus, hidden in the dom are some "summary" 
rows which will soon display graphs and data about the individual measurements 
that compose a single table cell.

Re more overhead:
A shortcut is coming soon that should help.  microbenchmarks.appspot.com/lucky 
will take you to your most recent run (provided you're using an API key) so 
that you don't have to copy-and-paste the URL.

Re updating the old ConsoleReport:
Honestly, it's probably in the revision history somewhere.  The decision to axe 
it wasn't the new API (they're quite similar), but because people seem dead-set 
on using the as "results".  I've seen a few SO answers with Caliper output or 
similar that claim that X is faster than Y because these 3 numbers and these 
three ASCII bars tell me so.  No regard for the machine.  No regard for the OS. 
 No regard for the VM.  No regard for the spread of the data.  If I could 
believe that everyone would share your sensibility and treat console output as 
a development tool and web UI tables as a result, I'd be happy to put it back 
in immediately.

Re deleting old runs:
2 outstanding features are to delete a run entirely and to just disassociate it 
with an API key.  They're on their way.

Re sharing results:
Just send the link.  :-)

I hope all of this helps.

Original comment by gak@google.com on 15 Apr 2013 at 7:41

GoogleCodeExporter commented 9 years ago
When I send the link, does my API key appear in the URL? 

Original comment by agrothberg on 17 Apr 2013 at 12:04

GoogleCodeExporter commented 9 years ago

I was somewhat facetious about the colorful bars; I do recognize that the equal 
signs are less precise than the HTML rendered bars. However, the equals signs 
were captioned with numbers which helped when results were close. 

As for "errors bars" these can in themselves be misleading. There is a 
difference between the "standard error of the mean" trial time (which should 
->0 as number of tials -> infinity) and the estimate for the standard deviation 
in trial time (which should NOT tend toward zero).

Simple plotting an error bar on the graph isn't too helpful if it is not clear 
which of these measures is in use. I actually think both of these measures are 
interesting, however the standard error should be driven to a value 
sufficiently close to zero with enough trial runs.

These two notions are also related to statistical and economic signifcance. 

There are a couple of points I would like to address regarding the usefulness 
of console results:

There is the case where I am certain that doing XYZ in Java will be super cool 
and make my code faster. After I code it up and run caliper, lo and behold the 
run times are (statistically) the same. Sure this COULD just be on my machine, 
but do I really need to upload the results and share the link? Many times it 
turns out the java byte code is the same and I was just being silly. Or....

I spend all this time coding up a (non trivial) test and caliper reports runs 
time of 0ns (or 1 ns). Do I really need to upload this to the internet to know 
that my test is poorly written and the JIT has optimzized away my code? I would 
argue that I can see this issue on the command line and then attempt to fix the 
test code without having to upload the results.

Original comment by agrothberg on 17 Apr 2013 at 2:02

GoogleCodeExporter commented 9 years ago
For a given Trial, how is the array of Measurements (made up of weights and 
Values) converted to a final "runtime". For example I got the following:

description weight      value
runtime 19692924.000000 4.89578374E8ns
runtime 14017116.000000 3.51961134E8ns
runtime 23573572.000000 5.87860795E8ns
runtime 15044739.000000 3.77008712E8ns
runtime 22290364.000000 5.56851245E8ns
runtime 17069313.000000 4.25707493E8ns
runtime 15377889.000000 3.85582141E8ns
runtime 14652288.000000 3.67722974E8ns
runtime 24524040.000000 6.25610075E8ns

which resulted in:

https://microbenchmarks.appspot.com/runs/0992862c-2410-4ff0-8ac4-42f252cca080 
(results above are for nanoTime), a runtime of "25.059"

Original comment by agrothberg on 18 Apr 2013 at 12:31

GoogleCodeExporter commented 9 years ago
So, I've been considering this a bit.  I think there's a middle ground that we 
might be able to reach.  Right now we have output like this:

Starting experiment 1 of 32: {instrument=allocation, method=Sort, vm=default, 
parameters={distribution=SAWTOOTH, length=10}}
Complete!
Starting experiment 2 of 32: {instrument=allocation, method=Sort, vm=default, 
parameters={distribution=SAWTOOTH, length=100}}
Complete!
…

What if each successful experiment output summary statistics for its 
measurements?  (I'm using 0's rather than real data)

Starting experiment 1 of 32: {instrument=allocation, method=Sort, vm=default, 
parameters={distribution=SAWTOOTH, length=10}}
  objects: min=0.0, 1st qu.=0.0, median=0.0, mean=0.0, 3rd qu.=0.0, max=0.0
  bytes: min=0.0, 1st qu.=0.0, median=0.0, mean=0.0, 3rd qu.=0.0, max=0.0
Starting experiment 2 of 32: {instrument=micro, method=Sort, vm=default, 
parameters={distribution=SAWTOOTH, length=10}}
  runtime: min=0.0, 1st qu.=0.0, median=0.0, mean=0.0, 3rd qu.=0.0, max=0.0
…

FWIW, this is similar to the output from R's summary function.  The reason I 
like it is that it's quite specifically not a table.  It's a good summary of 
the data that you can use to get a reasonable idea of the data, but nobody 
should try to use it as a "result".

Thoughts?

Original comment by gak@google.com on 22 Apr 2013 at 5:09

GoogleCodeExporter commented 9 years ago
That seems reasonable.

Original comment by agrothberg on 23 Apr 2013 at 3:09

GoogleCodeExporter commented 9 years ago
I think you may find that this is going to be a non-starter for a lot of 
developers. 

I'm not really excited about adding a 3rd party for all my benchmarks which can 
just break.

And specifically, it's broken for me now :-(

So caliper is completely useless to me at the moment because the app is down.

Original comment by burtona...@gmail.com on 26 Jul 2013 at 10:44

GoogleCodeExporter commented 9 years ago
We're working with the AE team to figure out why the webapp has had so many 
problems.  The poor reliability of the app has been a big headache.

That said, every time I feel like there may be argument for a distilled 
benchmark report I see a bug report like the one on 
https://code.google.com/p/guava-libraries/issues/detail?id=1486#c4 .  There is 
so little data being reported, so little information about the VM, so little 
about the variability in the experiments that the data is virtually worthless.  
Not that I don't agree with some of the conclusions or think that the numbers 
might not reflect some version of some valid use case, but the report gives so 
little information that it just sort of masquerades as data.  It's a tough call.

Also, I do think that "completely useless" is a bit of a overstatement.  From 
the 500 rates being recorded, running any run twice should yield a complete 
data set the vast majority of the time.

Original comment by gak@google.com on 26 Jul 2013 at 11:06

GoogleCodeExporter commented 9 years ago
So here's my perspective. I'm dead in the water now.

I'm trying to work on benchmarking my code and 95% of what I want is just a 
GENERAL understanding of the performance.  The console output WOULD give me 
that number.

You guys can just have a very big disclaimer saying "THE CONSOLE OUTPUT IS ONLY 
AN APPROXIMATION" ... or something along these lines.  

It's VERY similar to the situation of having microbenchmarks for all your 
hardware platforms.  Only having ONE platform is kind of useless.  you really 
need to have all the JVMs you're running on tested as well as all the OSes and 
hardware.

But you wouldn't suggest actually HIDING this data from the user.

I think this is the state I'm in now.. I'm flying blind, the webapp is down, 
and I have no way to resolve this.

Just do both... show a brief console report and a link to the webapp.

Another option is to ship a native app too.. 

Original comment by burtona...@gmail.com on 26 Jul 2013 at 11:13

GoogleCodeExporter commented 9 years ago
another idea here... how about just writing a static .html report and maybe 
using Velocity.

I'd still like the HTML reports but I'd like them more like the junit HTML 
output so I can integrate them into our continuous integration system.

Original comment by burtona...@gmail.com on 27 Jul 2013 at 3:51

GoogleCodeExporter commented 9 years ago
First, I'm marking this as "Accepted" for now and updating the description. I 
realized that as of comment #11 we've been back to considering it.  Making 
development easier is a priority even if we don't have the best answer for that 
yet.  I'm still not sure that the console is the right way to go, but this 
remains an obvious gap.

Original comment by gak@google.com on 29 Jul 2013 at 6:37

GoogleCodeExporter commented 9 years ago
I agree the output can be misused, but you should understand that many Caliper 
users are experts, and they might use the limited console output for valid 
purposes.  What if you add it back, but just have it off by default? Then the 
"unwashed masses" or whatever won't abuse it much, but it's still available for 
those who know enough to dig for it.

Original comment by travis.d...@gmail.com on 22 Aug 2013 at 11:29

GoogleCodeExporter commented 9 years ago
As a first step, I've added some output in the format described in #11.  Give 
it a try.  Feedback welcome.

Original comment by gak@google.com on 27 Aug 2013 at 8:02

GoogleCodeExporter commented 9 years ago
My servers work in a LAN and cannot connect to the Internet, How can I get the 
result?

Original comment by trueman1...@gmail.com on 5 Sep 2013 at 6:50

GoogleCodeExporter commented 9 years ago
We have plans to distribute the webapp for you to run locally (Issue 255) and 
to enable manual uploading (Issue 259).  Hopefully, one of those solutions will 
work for you.

Original comment by gak@google.com on 5 Sep 2013 at 5:28

GoogleCodeExporter commented 9 years ago
if you could also just burn static HTML that would be ideal as well.

Ideally I wouldn't need to setup a webapp.

Original comment by burtonat...@gmail.com on 5 Sep 2013 at 5:42

GoogleCodeExporter commented 9 years ago
Would you accept a contribution that restored the console output? How about one 
that included an error indication?

Apart from the technical discussion about whether console output is valid or 
will be misused, in many settings it may be against policy to upload results 
from internal benchmarks to a third party website (I don't even want to ask, 
because I'm sure the answer is "no, you may not").

Original comment by travis.d...@gmail.com on 5 Sep 2013 at 9:32

GoogleCodeExporter commented 9 years ago
There are no plans to add any more output than what was added in 
https://code.google.com/p/caliper/source/detail?r=f6ebf866113c51c207645c5001c706
c8c637acbf .

I'm not sure what you mean by "included an error indication".  Errors are 
always displayed on the console.

Also, keep in mind that all data that would be otherwise uploaded is always 
stored in your ~/.caliper/results folder.

Original comment by gak@google.com on 5 Sep 2013 at 9:41

GoogleCodeExporter commented 9 years ago
By "error" indication, I meant standard error and/or result std deviation, the 
lack of which seemed to be one of the points against console output.

Original comment by travis.d...@gmail.com on 5 Sep 2013 at 9:43

GoogleCodeExporter commented 9 years ago
Ah, got it.  That probably won't make it to the console, but there is an 
obvious gap between a json file and the public webapp that needs to be filled.  
The current plan is to release the webapp as a locally runnable binary to 
address it.

Original comment by gak@google.com on 5 Sep 2013 at 9:50

GoogleCodeExporter commented 9 years ago
It's a sad day when I need to launch a web app locally to interpret a few dozen 
bytes of json :(

Original comment by travis.d...@gmail.com on 5 Sep 2013 at 9:54

GoogleCodeExporter commented 9 years ago
With all due respect… and I think you guys have done a great job on caliper 
so far… I think you have your head in the clouds over the web application.  
Pun half intended ;)

I mean at Google I'm sure everything is cloud cloud cloud but the world just 
isn't in that state yet… nor does it always NEED to be.

There's really nothing wrong with static compiled HTML and compiled output.  If 
the console doesn't do the job just put a BIG warning indication on it stating 
to view the web interface.  

A static HTML file would solve 99% of everyone's problems I think.

You would get your issues resolved with simple console output not showing 
enough detail and those of us behind firewalls or not wanting to rely on a 3rd 
party could get their reports.

Honestly this seems win/win for everyone :)

Original comment by burtonat...@gmail.com on 5 Sep 2013 at 9:59

GoogleCodeExporter commented 9 years ago
If you are interested in manually analyzing the results, I've made my code for 
this public at https://code.google.com/p/caliper-analyze/

The statistics in my tool are carefully chosen (if you find any better 
statistic, please contribute!), in particular with respect to numerical 
precision (which is just too easy to do badly when computing variance; note 
that the statistics in previous caliper versions were flawed by this type of 
low-quality statistics: https://code.google.com/p/caliper/issues/detail?id=200 
as the webapp has not been released, I cannot see whether the code in there is 
more robust.)

Right now, my tool only does some basic console output. I've been working a 
little bit on predicting the scalability factors (i.e. linear, n log n, 
quadratic) by regression; but that is not working reliably yet. (c.f. 
https://code.google.com/p/caliper/issues/detail?id=119 )

Anyway, if you want a simple console-based way of analyzing your runs, this 
tool is probably useful for you. In particular, it will give you the properly 
weighted mean, and the properly weighted standard deviation, both absolute and 
relative.

The usual disclaimers about benchmarking of course still apply... in 
particular, if you see a high standard deviation, consider re-running the 
experiments with more iterations...

Original comment by erich.sc...@gmail.com on 1 Nov 2013 at 6:19