Property based benchmarking

PragTob commented 6 years ago

This had been in my mind for a long time but I guess it's worth getting out there:

We already have inputs. Property based testing is great and support in the elixir system is growing a lot. Why not just put in a generator as an input and let it generate inputs? Statistics will become more interesting!

As generators usually generate more complex data as it goes on it might be a bit weird as the first recorded run times will be fast and the last recorded run times (or memory measurements) will be higher. We might even need a different type of generator - only time will tell.

We need to save and show the seed to ensure reproducibility.

Once we have that there are a lot of optional things we could do:

Remember the slowest input and report it back...
... attempt to shrink that worst case input?
maybe even record the statistics per input (and do each one for some time to have actual statistics)
we probably also need to record how many iterations were run and be able to rerun them with the same number of iterations (as afterwards the generators produce new values).
does this live in core or in a plugin? Properly cause it's an external dependency (stream data isn't core elixir yet afaik)

This is definitely a >= 1.x or 2.x feature even ;)

devonestes commented 6 years ago

I’m curious - what do you think this would tell us about our code? Will we be able to clearly see from the results if our function is O(n) vs O(n^2) or something like that?

The only thing I could see this for is asserting that a function is O(1) time or O(1) memory, since that’s the only property that should be measurable for any random input of any size. But you could also assert that with two inputs that differ by an order of magnitude, without the need for stream_data.

PragTob commented 6 years ago

I haven't looked at it from the perspective of O-notations. That might be doable though.

I've looked at it more from the perspective of:

finding performance edge cases. Generators are usually implemented to generate edge case worthy data (think: completely sorted list as the worst case for quicksort for instance) and then possible reporting those back to the user. Just like property based testing, the program helps you come up with performance edge cases.
easily getting a performance profile of a function across a wide array of inputs. Right now you can put in multiple inputs and see that one function might be faster for some than for others. Sometimes hard to say which one is better on average. Generators could be used to basically mimic real application traffic (which is varying) and give you a performance profile for that case. So almost like application performance monitoring, just offline :D

PragTob commented 6 years ago

also in case anything hits a cache on any layer (even cpu lvl...) then more varied inputs assure more accurate results :)

devonestes commented 6 years ago

Ok, so now I see how this might work - or at least I have some idea of how I think this could work. I was just totally missing it at first.

We could run each randomly generated input a certain number of times (configurable by the user), and after that number of runs we'd move on to the next input. Then, after the total benchmark time is up (also configurable by the user as it is now), we'd show the n fastest and slowest inputs and the stats for those inputs, and maybe the median as well? Is that along the lines of what you're thinking?

I just totally wasn't thinking about it in those terms at first, so the whole concept escaped me! I can certainly see value in that, although it does sound like a somewhat significant amount of work.

The second thing though - getting a performance profile across a wide array of inputs - can be done now. Take the following example:

input_stream = StreamData.integer()
get_num = fn -> Stream.take(input_stream, 1) |> Enum.map(& &1) |> hd() end

Benchee.run(%{
  "some important business code" => fn -> Business.make_money(get_num()) end
}, time: 10)

One of the awesome composability benefits of having benchee's API be "just a function" :tada:

PragTob commented 6 years ago

Yeah I was thinking in both ways :) And I'm not quite sure which one I like better.

E.g. cases:

run each input a certain amount of time (gives you detailed insight in behaviour of different inputs)
wide array of inputs (some nice average numbers)

Your example could be improved up I think - right now get_num has an impact on the run time. We could stick it into a before_each and feed it to the function for maximum benefit. That's sort of what I imagine the implementations look like (separate step though probably)

PragTob commented 6 years ago

(note to self, property based benchmarking is probably a horrible name as we're not checking properties of the benchmarks but it's more like streamdata based benchmarking or random data based benchmarking or fuzzing benchmarking or something)

bencheeorg / benchee

Property based benchmarking #249