String.downcase/2 with `:ascii ` performs worse than `:default`

tckb commented 6 years ago

While routinely checking on performance of String.downcase, I noticed that String.downcase/2 with :ascii performs much worse the default option.

As per the document, it should perform better, if not several orders of magnitude better than the default case. My test input is always an ASCII input without any smell of unicode characters in it.

I have tried it with a simple benchmark script.

Benchmark script

Benchee.run(
      %{
        "string downcase" => fn -> String.downcase("ABCDEFGHIJKLMNO") end,
        "string downcase ascii" => fn -> String.downcase("ABCDEFGHIJKLMNO", :ascii) end
      },
      warmup: 5,
      time: 10,
      formatter_options: %{
        console: %{
          extended_statistics: true
        }
      }
    )

Results

Name                            ips        average  deviation         median         99th %
string downcase              1.25 M        0.80 μs  ±1023.30%        0.60 μs        2.40 μs
string downcase ascii        0.61 M        1.65 μs   ±297.88%        1.30 μs        4.40 μs

Comparison:
string downcase              1.25 M
string downcase ascii        0.61 M - 2.06x slower

Extended statistics:

Name                          minimum        maximum    sample size                     mode
string downcase               0.60 μs     4876.20 μs         1.13 M                  0.60 μs
string downcase ascii         1.20 μs     1537.40 μs       572.37 K                  1.30 μs```

This seems quite surprising to me, even counter-intuitive. Am I missing something here?

Environment

Operating System: macOS CPU Information: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz Number of Available Cores: 4 Available memory: 8 GB Elixir 1.6.4 Erlang 20.3.4

michalmuskala commented 6 years ago

I checked myself that this is the case. I don't understand yet, how that's possible, though.

tckb commented 6 years ago

Ah @michalmuskala , I was wondering if I have to test it further on another platform. thanks for the confirmation

bbense commented 6 years ago

I get similar results on OS X.

Name ips average deviation median 99th % string downcase 843.38 K 1.19 μs ±1433.66% 0.70 μs 3.80 μs string downcase ascii 260.78 K 3.83 μs ±3878.60% 2 μs 7 μs

But the deviations are so high I wonder what else is going on.

aerosol commented 6 years ago

Does not seem to be platform related (I'm on Linux), but compiler (I'm not sure which one).

I'm sorry I'm not qualified enough to tell what's exactly the culprit here.

Plain old recursive accumulator seems to be the fastest (and as fast as :default in practice).

defmodule Impl do

  def downcase_ascii_stdlib(string) when is_binary(string) do
    for <<x <- string>>,
      do: if(x >= ?A and x <= ?Z, do: <<x + 32>>, else: <<x>>),
      into: ""
  end

  for code <- ?A..?Z do
    def down_ascii(unquote(code)), do: <<unquote(code+32)>>
  end
  def down_ascii(other), do: <<other>>

  def down_ascii2(x) when x >= ?A and x <= ?Z, do: <<x + 32>>
  def down_ascii2(x), do: <<x>>

  def dn1(str) do
    for <<x <- str>>, do: down_ascii(x), into: ""
  end

  def dn2(str) do
    for <<x <- str>>, do: down_ascii2(x), into: ""
  end

  def dn3(x), do: dn3(x, [])
  def dn3(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
  def dn3(<<x, rest::bits>>, acc) do
    dn3(rest, [down_ascii(x) | acc])
  end

  # The Winner
  def dn4(x), do: dn3(x, [])
  def dn4(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
  def dn4(<<x, rest::bits>>, acc) do
    dn4(rest, [down_ascii2(x) | acc])
  end

  def dn5(x), do: dn3(x, [])
  def dn5(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
  for x <- ?A..?Z do
    def dn5(<<unquote(x), rest::bits>>, acc) do
      dn5(rest, [unquote(x+32) | acc])
    end
  end
  def dn5(<<x, rest::bits>>, acc) do
    dn5(rest, [x | acc])
  end

  def dn6(str) do
    IO.iodata_to_binary(
      (for <<x <- str>>, do: down_ascii2(x))
    )
  end
end

    Benchee.run(
      %{
        "string downcase ascii" => fn -> String.downcase("ABCDEFGHIJKLMNO", :ascii) end,
        "string downcase default" => fn -> String.downcase("ABCDEFGHIJKLMNO") end,
        "dn1 (for with per char pattern matching, precomputed)" => fn -> dn1("ABCDEFGHIJKLMNO") end,
        "dn2 (for with guarded pattern matching)" => fn -> dn2("ABCDEFGHIJKLMNO") end,
        "dn3 (recursion with accumulator and per char pattern matching)" => fn -> dn3("ABCDEFGHIJKLMNO") end,
        "dn4 (recursion with accumulator and guarded pattern matching)" => fn -> dn4("ABCDEFGHIJKLMNO") end,
        "dn5 (recursion with accumulator and per char pattern matching II)" => fn -> dn5("ABCDEFGHIJKLMNO") end,
        "dn6 (for into list and iodata to binary)" => fn -> dn5("ABCDEFGHIJKLMNO") end
      },
      warmup: 5,
      time: 10,
      formatter_options: %{
        console: %{
          extended_statistics: true
        }
      }
    )

Name                                                                        ips        average  deviation         median         99th %
dn4 (recursion with accumulator and guarded pattern matching)            1.84 M        0.54 μs   ±505.61%        0.50 μs        0.90 μs
dn3 (recursion with accumulator and per char pattern matching)           1.83 M        0.55 μs   ±531.24%        0.50 μs        0.90 μs
dn5 (recursion with accumulator and per char pattern matching II)        1.81 M        0.55 μs   ±469.04%        0.50 μs        1.10 μs
dn6 (for into list and iodata to binary)                                 1.80 M        0.56 μs   ±515.69%        0.50 μs           1 μs
string downcase default                                                  1.79 M        0.56 μs   ±480.32%        0.50 μs           1 μs
dn1 (for with per char pattern matching, precomputed)                    1.02 M        0.98 μs   ±445.10%        0.80 μs           2 μs
string downcase ascii                                                    0.77 M        1.31 μs   ±261.11%        1.10 μs        2.70 μs
dn2 (for with guarded pattern matching)                                  0.73 M        1.37 μs   ±245.79%        1.20 μs        2.90 μs

Comparison:
dn4 (recursion with accumulator and guarded pattern matching)            1.84 M
dn3 (recursion with accumulator and per char pattern matching)           1.83 M - 1.00x slower
dn5 (recursion with accumulator and per char pattern matching II)        1.81 M - 1.02x slower
dn6 (for into list and iodata to binary)                                 1.80 M - 1.02x slower
string downcase default                                                  1.79 M - 1.03x slower
dn1 (for with per char pattern matching, precomputed)                    1.02 M - 1.81x slower
string downcase ascii                                                    0.77 M - 2.41x slower
dn2 (for with guarded pattern matching)                                  0.73 M - 2.51x slower

Extended statistics:

Name                                                                      minimum        maximum    sample size                     mode
dn4 (recursion with accumulator and guarded pattern matching)             0.30 μs     1337.10 μs       921.96 K                  0.50 μs
dn3 (recursion with accumulator and per char pattern matching)            0.30 μs     1322.90 μs       919.63 K                  0.50 μs
dn5 (recursion with accumulator and per char pattern matching II)         0.30 μs     1149.50 μs       911.51 K                  0.50 μs
dn6 (for into list and iodata to binary)                                  0.30 μs     1602.90 μs       909.02 K                  0.50 μs
string downcase default                                                   0.40 μs        1366 μs       905.28 K                  0.50 μs
dn1 (for with per char pattern matching, precomputed)                     0.60 μs     1076.30 μs       653.05 K                  0.80 μs
string downcase ascii                                                     0.90 μs     1013.40 μs       535.08 K                  1.10 μs
dn2 (for with guarded pattern matching)                                   0.90 μs      798.30 μs       518.04 K                  1.10 μs

What stdlib implementation does, is:

{function, downcase_ascii_stdlib, 1, 93}.
  {label,92}.
    {line,[{location,"lib/impl.ex",3}]}.
    {func_info,{atom,'Elixir.Impl'},{atom,downcase_ascii_stdlib},1}.
  {label,93}.
    {test,is_binary,{f,92},[{x,0}]}.
    {allocate,1,1}.
    {move,{x,0},{y,0}}.
    {move,{integer,256},{x,0}}.
    bs_init_writable.
    {move,{x,0},{x,1}}.
    {move,{y,0},{x,0}}.
    {call_last,2,{f,99},1}.

And dn4:

{function, dn4, 1, 20}.
  {label,19}.
    {line,[{location,"lib/impl.ex",31}]}.
    {func_info,{atom,'Elixir.Impl'},{atom,dn4},1}.
  {label,20}.
    {move,nil,{x,1}}.
    {call_only,2,{f,16}}.

{function, dn4, 2, 22}.
  {label,21}.
    {line,[{location,"lib/impl.ex",32}]}.
    {func_info,{atom,'Elixir.Impl'},{atom,dn4},2}.
  {label,22}.
    {test,bs_start_match2,{f,21},2,[{x,0},0],{x,0}}.
    {test,bs_test_tail2,{f,23},[{x,0},0]}.
    {allocate,0,2}.
    {move,{x,1},{x,0}}.
    {line,[{location,"lib/impl.ex",32}]}.
    {call_ext,1,{extfunc,lists,reverse,1}}.
    {line,[{location,"lib/impl.ex",32}]}.
    {call_ext_last,1,{extfunc,erlang,iolist_to_binary,1},0}.
  {label,23}.
    {test,bs_get_integer2,
          {f,24},
          2,
          [{x,0},
           {integer,8},
           1,
           {field_flags,[{anno,[33,{file,"lib/impl.ex"}]},unsigned,big]}],
          {x,2}}.
    {'%',{bin_opt,[33,{file,"lib/impl.ex"}]}}.
    {allocate,2,3}.
    {move,{x,0},{y,1}}.
    {move,{x,2},{x,0}}.
    {move,{x,1},{y,0}}.
    {line,[{location,"lib/impl.ex",34}]}.
    {call,1,{f,90}}.
    {test_heap,2,1}.
    {put_list,{x,0},{y,0},{x,1}}.
    {move,{y,1},{x,0}}.
    {call_last,2,{f,22},2}.
  {label,24}.
    {bs_context_to_binary,{x,0}}.
    {jump,{f,21}}.

Hope that saves someone a bit of work.

josevalim commented 6 years ago

Thanks @aerosol! Note that your dn4 and dn5 are simply calling dn3. Not sure if that's true in your actual implementation too. Given your benchmarks, here is what I would try:

Try a larger string. Calling String.duplicate(..., 100) on the current input should be enough.
Try a large string where there is no conversion. For example, the string we have today but where everything is already downcased.
Try combinations of your winning implementation. I would try at least two:

a. one that inlines the check:

  def dn4(x), do: dn4(x, [])
  def dn4(<<x, rest::bits>>, acc) when x >= ?A and x <= ?Z, do: dn4(rest, [x - 32 | acc])
  def dn4(<<x, rest::bits>>, acc), do: dn4(rest, [x | acc])
  def dn4(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))

b. and one that is not tail recursive:

  def dn4(x), do:  IO.iodata_to_binary(do_dn4(x))
  def do_dn4(<<x, rest::bits>>) when x >= ?A and x <= ?Z, do: [x - 32 | do_dn4(rest)]
  def do_dn4(<<x, rest::bits>>), do: [x | do_dn4(rest))
  def do_dn4(<<>>), do: []

michalmuskala commented 6 years ago

It's surprising binary comprehensions are slower than building a list and then calling IO.iodata_to_binary on it. I'm almost tempted to report this as a bug in Erlang - it seems very counter-intuitive.

tckb commented 6 years ago

well, that's exactly what my understanding is. does that mean, I'd have to review the places where there are comprehensions? but, if this was the case since the beginning, why would someone document it be faster than the :default case?

aerosol commented 6 years ago

Note that your dn4 and dn5 are simply calling dn3

I am terrible at copy-pasta, excellent catch

It's surprising binary comprehensions are slower than building a list and then calling IO.iodata_to_binary on it.

And reverse!

I'm almost tempted to report this as a bug in Erlang - it seems very counter-intuitive.

Uneducated guess, but is it possible it has to do anything with bs_init_writable (binary preallocation)?

aerosol commented 6 years ago

@josevalim thanks for all the pointers, I'll look into applying your suggestions

tckb commented 6 years ago

@josevalim @aerosol did a quick bench on the suggestions,

def dn4_1(x), do: dn4_1(x, [])
def dn4_1(<<x, rest::bits>>, acc) when x >= ?A and x <= ?Z, do: dn4_1(rest, [x + 32 | acc])
def dn4_1(<<x, rest::bits>>, acc), do: dn4_1(rest, [x | acc])
def dn4_1(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))

def dn4_2(x), do:  IO.iodata_to_binary(do_dn4(x))
def do_dn4(<<x, rest::bits>>) when x >= ?A and x <= ?Z, do: [x + 32 | do_dn4(rest)]
def do_dn4(<<x, rest::bits>>), do: [x | do_dn4(rest)]
def do_dn4(<<>>), do: []

turns out the first combo out beats the rest,

name                            ips        average  deviation         median         99th %
string downcase dn4_1        1.90 M        0.53 μs  ±1353.28%        0.50 μs        0.90 μs
string downcase dn4_2        1.84 M        0.54 μs  ±1104.37%        0.50 μs           1 μs
string downcase              1.44 M        0.70 μs   ±775.59%        0.60 μs        1.20 μs
string downcase ascii        0.68 M        1.46 μs   ±274.65%        1.30 μs        2.70 μs

Comparison:
string downcase dn4_1        1.90 M
string downcase dn4_2        1.84 M - 1.03x slower
string downcase              1.44 M - 1.32x slower
string downcase ascii        0.68 M - 2.78x slower

also minor nit-pick, the downcasing is +32 and upcasing is -32 :)

aerosol commented 6 years ago

also minor nit-pick, the downcasing is +32 and upcasing is -32 :)

blame ASCII :)

aerosol commented 6 years ago

does that mean, I'd have to review the places where there are comprehensions?

binary comprehensions, yes, if you really care about that level of performance

josevalim commented 6 years ago

Thanks everybody! Can somebody please send a PR updating both upcase and downcase code to dn4_1? Thank you!

tckb commented 6 years ago

sure @josevalim I'm preparing the PR for this.

michalmuskala commented 6 years ago

I think we should go with dn4_2 - even if it's marginally less performant it should create half as much garbage and pay off with less GC in the long term.

bbense commented 6 years ago

I think it would be wise to test performance on a wider selection of inputs, given all the unintuitive results so far. We want to make sure we aren't optimizing for a single use case.

On Sat, May 12, 2018 at 7:52 AM, Michał Muskała notifications@github.com wrote:

I think we should go with dn4_2 - even if it's marginally less performant it should create half as much garbage and pay off with less GC in the long term.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elixir-lang/elixir/issues/7673#issuecomment-388560437, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ3NINpYQCzP6JyVe6KZpsbTPDe22REks5txvdCgaJpZM4T73hU .

tckb commented 6 years ago

@bbense already working on this, working on bigger input and testing on this. Am still a bit puzzled performance difference between dn4_1 vs dn4_2 be it small. is this because of the gc overhead?

josevalim commented 6 years ago

Closing in favor of the PR.

elixir-lang / elixir