Closed tckb closed 6 years ago
I checked myself that this is the case. I don't understand yet, how that's possible, though.
Ah @michalmuskala , I was wondering if I have to test it further on another platform. thanks for the confirmation
I get similar results on OS X.
Name ips average deviation median 99th % string downcase 843.38 K 1.19 μs ±1433.66% 0.70 μs 3.80 μs string downcase ascii 260.78 K 3.83 μs ±3878.60% 2 μs 7 μs
But the deviations are so high I wonder what else is going on.
Does not seem to be platform related (I'm on Linux), but compiler (I'm not sure which one).
I'm sorry I'm not qualified enough to tell what's exactly the culprit here.
Plain old recursive accumulator seems to be the fastest (and as fast as :default
in practice).
defmodule Impl do
def downcase_ascii_stdlib(string) when is_binary(string) do
for <<x <- string>>,
do: if(x >= ?A and x <= ?Z, do: <<x + 32>>, else: <<x>>),
into: ""
end
for code <- ?A..?Z do
def down_ascii(unquote(code)), do: <<unquote(code+32)>>
end
def down_ascii(other), do: <<other>>
def down_ascii2(x) when x >= ?A and x <= ?Z, do: <<x + 32>>
def down_ascii2(x), do: <<x>>
def dn1(str) do
for <<x <- str>>, do: down_ascii(x), into: ""
end
def dn2(str) do
for <<x <- str>>, do: down_ascii2(x), into: ""
end
def dn3(x), do: dn3(x, [])
def dn3(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
def dn3(<<x, rest::bits>>, acc) do
dn3(rest, [down_ascii(x) | acc])
end
# The Winner
def dn4(x), do: dn3(x, [])
def dn4(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
def dn4(<<x, rest::bits>>, acc) do
dn4(rest, [down_ascii2(x) | acc])
end
def dn5(x), do: dn3(x, [])
def dn5(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
for x <- ?A..?Z do
def dn5(<<unquote(x), rest::bits>>, acc) do
dn5(rest, [unquote(x+32) | acc])
end
end
def dn5(<<x, rest::bits>>, acc) do
dn5(rest, [x | acc])
end
def dn6(str) do
IO.iodata_to_binary(
(for <<x <- str>>, do: down_ascii2(x))
)
end
end
Benchee.run(
%{
"string downcase ascii" => fn -> String.downcase("ABCDEFGHIJKLMNO", :ascii) end,
"string downcase default" => fn -> String.downcase("ABCDEFGHIJKLMNO") end,
"dn1 (for with per char pattern matching, precomputed)" => fn -> dn1("ABCDEFGHIJKLMNO") end,
"dn2 (for with guarded pattern matching)" => fn -> dn2("ABCDEFGHIJKLMNO") end,
"dn3 (recursion with accumulator and per char pattern matching)" => fn -> dn3("ABCDEFGHIJKLMNO") end,
"dn4 (recursion with accumulator and guarded pattern matching)" => fn -> dn4("ABCDEFGHIJKLMNO") end,
"dn5 (recursion with accumulator and per char pattern matching II)" => fn -> dn5("ABCDEFGHIJKLMNO") end,
"dn6 (for into list and iodata to binary)" => fn -> dn5("ABCDEFGHIJKLMNO") end
},
warmup: 5,
time: 10,
formatter_options: %{
console: %{
extended_statistics: true
}
}
)
Name ips average deviation median 99th %
dn4 (recursion with accumulator and guarded pattern matching) 1.84 M 0.54 μs ±505.61% 0.50 μs 0.90 μs
dn3 (recursion with accumulator and per char pattern matching) 1.83 M 0.55 μs ±531.24% 0.50 μs 0.90 μs
dn5 (recursion with accumulator and per char pattern matching II) 1.81 M 0.55 μs ±469.04% 0.50 μs 1.10 μs
dn6 (for into list and iodata to binary) 1.80 M 0.56 μs ±515.69% 0.50 μs 1 μs
string downcase default 1.79 M 0.56 μs ±480.32% 0.50 μs 1 μs
dn1 (for with per char pattern matching, precomputed) 1.02 M 0.98 μs ±445.10% 0.80 μs 2 μs
string downcase ascii 0.77 M 1.31 μs ±261.11% 1.10 μs 2.70 μs
dn2 (for with guarded pattern matching) 0.73 M 1.37 μs ±245.79% 1.20 μs 2.90 μs
Comparison:
dn4 (recursion with accumulator and guarded pattern matching) 1.84 M
dn3 (recursion with accumulator and per char pattern matching) 1.83 M - 1.00x slower
dn5 (recursion with accumulator and per char pattern matching II) 1.81 M - 1.02x slower
dn6 (for into list and iodata to binary) 1.80 M - 1.02x slower
string downcase default 1.79 M - 1.03x slower
dn1 (for with per char pattern matching, precomputed) 1.02 M - 1.81x slower
string downcase ascii 0.77 M - 2.41x slower
dn2 (for with guarded pattern matching) 0.73 M - 2.51x slower
Extended statistics:
Name minimum maximum sample size mode
dn4 (recursion with accumulator and guarded pattern matching) 0.30 μs 1337.10 μs 921.96 K 0.50 μs
dn3 (recursion with accumulator and per char pattern matching) 0.30 μs 1322.90 μs 919.63 K 0.50 μs
dn5 (recursion with accumulator and per char pattern matching II) 0.30 μs 1149.50 μs 911.51 K 0.50 μs
dn6 (for into list and iodata to binary) 0.30 μs 1602.90 μs 909.02 K 0.50 μs
string downcase default 0.40 μs 1366 μs 905.28 K 0.50 μs
dn1 (for with per char pattern matching, precomputed) 0.60 μs 1076.30 μs 653.05 K 0.80 μs
string downcase ascii 0.90 μs 1013.40 μs 535.08 K 1.10 μs
dn2 (for with guarded pattern matching) 0.90 μs 798.30 μs 518.04 K 1.10 μs
What stdlib implementation does, is:
{function, downcase_ascii_stdlib, 1, 93}.
{label,92}.
{line,[{location,"lib/impl.ex",3}]}.
{func_info,{atom,'Elixir.Impl'},{atom,downcase_ascii_stdlib},1}.
{label,93}.
{test,is_binary,{f,92},[{x,0}]}.
{allocate,1,1}.
{move,{x,0},{y,0}}.
{move,{integer,256},{x,0}}.
bs_init_writable.
{move,{x,0},{x,1}}.
{move,{y,0},{x,0}}.
{call_last,2,{f,99},1}.
And dn4
:
{function, dn4, 1, 20}.
{label,19}.
{line,[{location,"lib/impl.ex",31}]}.
{func_info,{atom,'Elixir.Impl'},{atom,dn4},1}.
{label,20}.
{move,nil,{x,1}}.
{call_only,2,{f,16}}.
{function, dn4, 2, 22}.
{label,21}.
{line,[{location,"lib/impl.ex",32}]}.
{func_info,{atom,'Elixir.Impl'},{atom,dn4},2}.
{label,22}.
{test,bs_start_match2,{f,21},2,[{x,0},0],{x,0}}.
{test,bs_test_tail2,{f,23},[{x,0},0]}.
{allocate,0,2}.
{move,{x,1},{x,0}}.
{line,[{location,"lib/impl.ex",32}]}.
{call_ext,1,{extfunc,lists,reverse,1}}.
{line,[{location,"lib/impl.ex",32}]}.
{call_ext_last,1,{extfunc,erlang,iolist_to_binary,1},0}.
{label,23}.
{test,bs_get_integer2,
{f,24},
2,
[{x,0},
{integer,8},
1,
{field_flags,[{anno,[33,{file,"lib/impl.ex"}]},unsigned,big]}],
{x,2}}.
{'%',{bin_opt,[33,{file,"lib/impl.ex"}]}}.
{allocate,2,3}.
{move,{x,0},{y,1}}.
{move,{x,2},{x,0}}.
{move,{x,1},{y,0}}.
{line,[{location,"lib/impl.ex",34}]}.
{call,1,{f,90}}.
{test_heap,2,1}.
{put_list,{x,0},{y,0},{x,1}}.
{move,{y,1},{x,0}}.
{call_last,2,{f,22},2}.
{label,24}.
{bs_context_to_binary,{x,0}}.
{jump,{f,21}}.
Hope that saves someone a bit of work.
Thanks @aerosol! Note that your dn4 and dn5 are simply calling dn3. Not sure if that's true in your actual implementation too. Given your benchmarks, here is what I would try:
Try a larger string. Calling String.duplicate(..., 100)
on the current input should be enough.
Try a large string where there is no conversion. For example, the string we have today but where everything is already downcased.
Try combinations of your winning implementation. I would try at least two:
a. one that inlines the check:
def dn4(x), do: dn4(x, [])
def dn4(<<x, rest::bits>>, acc) when x >= ?A and x <= ?Z, do: dn4(rest, [x - 32 | acc])
def dn4(<<x, rest::bits>>, acc), do: dn4(rest, [x | acc])
def dn4(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
b. and one that is not tail recursive:
def dn4(x), do: IO.iodata_to_binary(do_dn4(x))
def do_dn4(<<x, rest::bits>>) when x >= ?A and x <= ?Z, do: [x - 32 | do_dn4(rest)]
def do_dn4(<<x, rest::bits>>), do: [x | do_dn4(rest))
def do_dn4(<<>>), do: []
It's surprising binary comprehensions are slower than building a list and then calling IO.iodata_to_binary
on it. I'm almost tempted to report this as a bug in Erlang - it seems very counter-intuitive.
well, that's exactly what my understanding is. does that mean, I'd have to review the places where there are comprehensions? but, if this was the case since the beginning, why would someone document it be faster than the :default
case?
Note that your dn4 and dn5 are simply calling dn3
I am terrible at copy-pasta, excellent catch
It's surprising binary comprehensions are slower than building a list and then calling IO.iodata_to_binary on it.
And reverse!
I'm almost tempted to report this as a bug in Erlang - it seems very counter-intuitive.
Uneducated guess, but is it possible it has to do anything with bs_init_writable
(binary preallocation)?
@josevalim thanks for all the pointers, I'll look into applying your suggestions
@josevalim @aerosol did a quick bench on the suggestions,
def dn4_1(x), do: dn4_1(x, [])
def dn4_1(<<x, rest::bits>>, acc) when x >= ?A and x <= ?Z, do: dn4_1(rest, [x + 32 | acc])
def dn4_1(<<x, rest::bits>>, acc), do: dn4_1(rest, [x | acc])
def dn4_1(<<>>, acc), do: IO.iodata_to_binary(:lists.reverse(acc))
def dn4_2(x), do: IO.iodata_to_binary(do_dn4(x))
def do_dn4(<<x, rest::bits>>) when x >= ?A and x <= ?Z, do: [x + 32 | do_dn4(rest)]
def do_dn4(<<x, rest::bits>>), do: [x | do_dn4(rest)]
def do_dn4(<<>>), do: []
turns out the first combo out beats the rest,
name ips average deviation median 99th %
string downcase dn4_1 1.90 M 0.53 μs ±1353.28% 0.50 μs 0.90 μs
string downcase dn4_2 1.84 M 0.54 μs ±1104.37% 0.50 μs 1 μs
string downcase 1.44 M 0.70 μs ±775.59% 0.60 μs 1.20 μs
string downcase ascii 0.68 M 1.46 μs ±274.65% 1.30 μs 2.70 μs
Comparison:
string downcase dn4_1 1.90 M
string downcase dn4_2 1.84 M - 1.03x slower
string downcase 1.44 M - 1.32x slower
string downcase ascii 0.68 M - 2.78x slower
also minor nit-pick, the downcasing is +32
and upcasing is -32
:)
also minor nit-pick, the downcasing is +32 and upcasing is -32 :)
blame ASCII :)
does that mean, I'd have to review the places where there are comprehensions?
binary comprehensions, yes, if you really care about that level of performance
Thanks everybody! Can somebody please send a PR updating both upcase and downcase code to dn4_1
? Thank you!
sure @josevalim I'm preparing the PR for this.
I think we should go with dn4_2
- even if it's marginally less performant it should create half as much garbage and pay off with less GC in the long term.
I think it would be wise to test performance on a wider selection of inputs, given all the unintuitive results so far. We want to make sure we aren't optimizing for a single use case.
On Sat, May 12, 2018 at 7:52 AM, Michał Muskała notifications@github.com wrote:
I think we should go with dn4_2 - even if it's marginally less performant it should create half as much garbage and pay off with less GC in the long term.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elixir-lang/elixir/issues/7673#issuecomment-388560437, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ3NINpYQCzP6JyVe6KZpsbTPDe22REks5txvdCgaJpZM4T73hU .
@bbense already working on this, working on bigger input and testing on this. Am still a bit puzzled performance difference between dn4_1 vs dn4_2 be it small. is this because of the gc overhead?
Closing in favor of the PR.
While routinely checking on performance of
String.downcase
, I noticed thatString.downcase/2
with:ascii
performs much worse the default option.As per the document, it should perform better, if not several orders of magnitude better than the default case. My test input is always an ASCII input without any smell of unicode characters in it.
I have tried it with a simple benchmark script.
Benchmark script
Results
This seems quite surprising to me, even counter-intuitive. Am I missing something here?
Environment
Operating System: macOS CPU Information: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz Number of Available Cores: 4 Available memory: 8 GB Elixir 1.6.4 Erlang 20.3.4