Inaccurate results when benchmarking `Phoenix.Token`

bencheeorg / benchee

Easy and extensible benchmarking in Elixir providing you with lots of statistics!

MIT License

1.41k stars 66 forks source link

Inaccurate results when benchmarking `Phoenix.Token` #296

Closed nathanl closed 5 years ago

nathanl commented 5 years ago

I ran a benchmark on Phoenix.Token to get info about how different :key_iterations values would impact performance.

To my surprise, benchee reports no difference in how long it takes sign + verify with 1 key iteration vs 20,000,000. However, there clearly is a difference, because the wall clock run time is very different depending on which value I use.

Here's the mix task where I run the comparison:

defmodule Mix.Tasks.BenchMarkToken do
  use Mix.Task
  alias MyApp.Endpoint

  @shortdoc "Compares token runs"
  def run(_) do
    Mix.Task.run("app.start")

    Benchee.run(
      %{
        "sign_and_verify" => fn key_iterations ->
          signed_token =
            Phoenix.Token.sign(
              Endpoint,
              "some salt",
              "some value",
              key_iterations: key_iterations
            )

          {:ok, "some value"} =
            Phoenix.Token.verify(
              Endpoint,
              "some salt",
              signed_token,
              key_iterations: key_iterations,
              max_age: :infinity
            )
        end
      },
      inputs: %{
        1 => 1,
        20_000_000 => 20_000_000
      }
    )
  end
end

If I comment out one or the other of the :inputs, I get large variations in measurement from the mix task itself.

For 1:

mix bench_mark_token  12.41s user 1.12s system 149% cpu 9.021 total

For 20_000_000

mix bench_mark_token  33.60s user 0.97s system 106% cpu 32.496 total

One of my colleagues guessed that this may be explained as follows:

Execution time may not include the BIF/NIF time used for the signing

PragTob commented 5 years ago

:wave:

benchee measures using a monotonic clock, so all time elapsed is considered no matter where it was spent.

The more likely source of this always being the same (imo) is that the signed value is cached somewhere (for a combination of secret, salt and input) and then just returned more or less immediately upon further invocation. As warmup isn't counted there should be no measurable difference.

I can see a couple of ways around this:

use randomly generated inputs, you could generate them in a before_each
bust the cache in an after_each
turn off the cache somehow

Of course all of these rely on my assumption that cache is involved which right now is the only way that I could explain that behaviour but I might be wrong :)

PragTob commented 5 years ago

Hi,

the behaviour heavily hints at what I was suspecting.

I mildy modified it (not as many iterations as it takes too long) and added a small custom formatter to look at the raw run times:

      inputs: %{
        1 => 1,
        20_000 => 20_000
      },
      warmup: 0,
      time: 1,
      formatters: [
        {Benchee.Formatters.Console, extended_statistics: true},
        fn suite ->
          Enum.each(suite.scenarios, fn scenario ->
            IO.puts("#{scenario.name}:")
            IO.inspect(scenario.run_time_data.samples)
          end)
        end
      ]
    )

as you can see, I also took warmup out of the equation (because that one big time where it is first calculated happens during warmup and hence doesn't show up in the results at all).

Here's what I get:

sign_and_verify:
[3464088.0, 55764.0, 30283.0, 29346.0, 26398.0, 27421.0, 26489.0, 26442.0,
 25288.0, 27206.0, 25236.0, 25531.0, 2.34e4, 22253.0, 24424.0, 25257.0, 25676.0,
 29571.0, 23954.0, 26265.0, 22351.0, 27024.0, 22443.0, 21854.0, 27460.0,
 41186.0, 26549.0, 29703.0, 23529.0, 22679.0, 29550.0, 22873.0, 2.23e4, 27988.0,
 23040.0, 23002.0, 27774.0, 22435.0, 22208.0, 26973.0, 22513.0, 22569.0,
 27488.0, 24514.0, 22431.0, 29543.0, 32822.0, 38169.0, 25355.0, 25009.0, ...]
sign_and_verify:
[40175774.0, 39955.0, 29489.0, 26335.0, 23168.0, 24290.0, 24974.0, 25533.0,
 25953.0, 26969.0, 23825.0, 27584.0, 22642.0, 26907.0, 22638.0, 22058.0,
 27950.0, 22145.0, 21529.0, 27354.0, 22368.0, 22736.0, 28298.0, 21946.0,
 21485.0, 27308.0, 22926.0, 22013.0, 28218.0, 22931.0, 23343.0, 27576.0,
 22926.0, 24748.0, 39374.0, 23984.0, 28529.0, 45464.0, 29909.0, 25449.0,
 26165.0, 22488.0, 22248.0, 23947.0, 24769.0, 25590.0, 27690.0, 22967.0,
 27042.0, 23329.0, ...]

you can see that after a first big initial time the times normalize. Caching at this point also makes sense so I'd say that's what it is and recommend one of the approaches suggested above :)

Cheers, Tobi

nathanl commented 5 years ago

Interesting, and thanks for looking into this. I agree that caching looks like a likely explanation based on your results. However, I've updated the test to use a random UUID as the value to sign, and I still see the pattern you show of long initial execution times tailing off to a shorter time.

Updated code:

defmodule Mix.Tasks.BenchMarkToken do
  use Mix.Task
  alias MyAppWeb.Endpoint

  @shortdoc "Compares token runs"
  def run(_) do
    Mix.Task.run("app.start")

    Benchee.run(
      %{
        "sign_and_verify" => fn key_iterations ->
          uuid = Ecto.UUID.generate()
          signed_token =
            Phoenix.Token.sign(
              Endpoint,
              "some salt",
              uuid,
              key_iterations: key_iterations
            )

          {:ok, ^uuid} =
            Phoenix.Token.verify(
              Endpoint,
              "some salt",
              signed_token,
              key_iterations: key_iterations,
              max_age: :infinity
            )
        end
      },
      inputs: %{
        1 => 1,
        200_000 => 200_000
      },
      warmup: 0,
      time: 1,
      formatters: [
        {Benchee.Formatters.Console, extended_statistics: true},
        fn suite ->
          Enum.each(suite.scenarios, fn scenario ->
            IO.puts("#{scenario.name} (#{scenario.input}):")
            IO.inspect(scenario.run_time_data.samples)
          end)
        end
      ]
    )
  end
end

produces

sign_and_verify (1):
[7768990.0, 126990.0, 83990.0, 146990.0, 80990.0, 93990.0, 75990.0, 72990.0,
 74990.0, 77990.0, 67990.0, 64990.0, 68990.0, 66990.0, 64990.0, 65990.0,
 74990.0, 65990.0, 64990.0, 72990.0, 67990.0, 61990.0, 65990.0, 66990.0,
 69990.0, 67990.0, 64990.0, 67990.0, 65990.0, 73990.0, 74990.0, 73990.0,
 68990.0, 66990.0, 65990.0, 65990.0, 66990.0, 83990.0, 69990.0, 66990.0,
 64990.0, 69990.0, 63990.0, 62990.0, 62990.0, 64990.0, 61990.0, 61990.0,
 74990.0, 57990.0, ...]
sign_and_verify (200000):
[260805990.0, 91990.0, 78990.0, 60990.0, 59990.0, 59990.0, 56990.0, 60990.0,
 55990.0, 58990.0, 53990.0, 53990.0, 53990.0, 55990.0, 52990.0, 52990.0,
 55990.0, 53990.0, 51990.0, 53990.0, 53990.0, 53990.0, 54990.0, 49990.0,
 53990.0, 53990.0, 58990.0, 60990.0, 56990.0, 62990.0, 49990.0, 48990.0,
 47990.0, 47990.0, 53990.0, 48990.0, 47990.0, 46990.0, 50990.0, 45990.0,
 44990.0, 45990.0, 47990.0, 44990.0, 64990.0, 57990.0, 55990.0, 55990.0,
 56990.0, 54990.0, ...]

It still could be something specific to this function, though. Will tinker further when I have time.

PragTob commented 5 years ago

Hey,

I looked at it a bit and both signing and verifying call out to KeyGenerator which takes the number of iterations, the algorithm, the salt and the secret - notably not the data to be signed - to generate a key to do the signing. call is here

That explains the behaviour. First call the key is generated which is slow, subsequent invocations are equally fast as intended.

As the salt needs to vary, you should vary the salt in your benchmarks not the user data. :)

Hope that helps, if anything comes up feel free to reopen.