JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

`isequal(::String, ::String)` performance #50442

Open chrstphrbrns opened 1 year ago

chrstphrbrns commented 1 year ago
julia> VERSION
v"1.11.0-DEV.0"

julia> using BenchmarkTools

julia> s="abc"
"abc"

julia> s_same="abc"
"abc"

julia> s_different="xyz"
"xyz"

julia> s_longer="xyzz"
"xyzz"

julia> function test(f)
           print("s,s"); @btime $f(s,s);
           print("s,s_same"); @btime $f(s,s_same);
           print("s,s_different"); @btime $f(s,s_different);
           print("s,s_longer"); @btime $f(s,s_longer);
       end
test (generic function with 1 method)

julia> isequal_current(a::String,b::String) = isequal(a,b)
isequal_current (generic function with 1 method)

julia> isequal_simple(a::String,b::String) = codeunits(a) == codeunits(b)
isequal_simple (generic function with 1 method)

julia> isequal_fancy(a::String,b::String) = ncodeunits(a) != ncodeunits(b) ? false : (ncodeunits(a) == 0 && ncodeunits(b) == 0 || (unsafe_load(pointer(a)) == unsafe_load(pointer(b)) && codeunits(a) == codeunits(b)))
isequal_fancy (generic function with 1 method)

julia> test(isequal_current)
s,s  2.589 ns (0 allocations: 0 bytes)
s,s_same  6.569 ns (0 allocations: 0 bytes)
s,s_different  6.570 ns (0 allocations: 0 bytes)
s,s_longer  4.469 ns (0 allocations: 0 bytes)
false

julia> test(isequal_simple)
s,s  4.939 ns (0 allocations: 0 bytes)
s,s_same  5.210 ns (0 allocations: 0 bytes)
s,s_different  4.419 ns (0 allocations: 0 bytes)
s,s_longer  4.000 ns (0 allocations: 0 bytes)
false

julia> test(isequal_fancy)
s,s  5.240 ns (0 allocations: 0 bytes)
s,s_same  5.500 ns (0 allocations: 0 bytes)
s,s_different  2.830 ns (0 allocations: 0 bytes)
s,s_longer  3.079 ns (0 allocations: 0 bytes)
false
Function (s, s) (s, s_same) (s, s_different) (s, s_longer)
isequal_current 2.589 6.569 6.570 4.469
isequal_simple 4.939 5.210 4.419 4.000
isequal_fancy 5.240 5.500 2.830 3.079
Seelengrab commented 1 year ago

What exactly is the issue here? String comparisons have to handle many more different cases than just lengths 3 & 4. It's a balancing act between longer & shorter lengths, and if we're one nanosecond slower on such short strings but faster on longer ones, I'd say that's a worthwhile tradeoff.

oscardssmith commented 1 year ago

It does seem like switching the check for egal and length might be worthwhile. It seems very plausible to me that string lengths are different more often than two identical strings are compared. (although obviously the performance difference will be minuscule)

chrstphrbrns commented 1 year ago

That's a fair point. Is there any evidence that the current implementation is optimal (given whatever assumptions)?

Even as I experiment with longer strings, it seems the only case where the current implementation wins is when the strings are identical, and that's presumably just by doing a pointer check, which the others could do as well

oscardssmith commented 1 year ago
julia> s = randstring(1000)
"Fl75lZFq7eXmrj2vSEkDouDStaFKJrQiyFICA21qMWdhFBguSrPDn8BPDnuY1ZhDM9gEJJygRl0o3edyCiYodYPTqpZ2yg4UYnJl7ZtdOsSI9ri0PbY7t5vqwbOfn6IJulX3xfdSX2UZSxPhMH0TMVa7trpsjfm8wOskcicVjtzC5qByBlUlH82YSeeWVyQF07xN9wTI8MgKIDnMSoMV4pvJeDgCLqiDZBIzDa3hyyeGIZSBvsrjFyiTqJm7Oxb8eWNhrHnO2FchzlaWbtWJAdWbJa5Ep7cEq428fyu4JPvMDkjuVFcMk2hhf4uF4kfvG76fzLGZG9awOlCwAbwC8WiR8Y4gufxwx3CCrPFbDYDQ2hV2Sk4HLWfe4SZLnZ75ov2fDP26aw1IX3fJSTypUdFS0IRhegB9tK" ⋯ 165 bytes ⋯ "Kmgw4w9lJiPB5a0IWCBv6YCx5ysCTg1x7HpWZivLBcCRHjBtMguqqPdLQdl5t1JbLG4qamfTrMShNFvnRLcK4SeICQQ9GFchAPs6gWnhbExiUhoJdK9cREzDC2olztLPmfZfGnNjC8Jwg3a2yklzhYqB3RymXb4qf7J6SlxCrBPIQD1m01Ym4KsKKMh2H3UWRrd5rPlHnjZHGAjtj6oG7kRACXTQ1gGfsgFgZ51A44KVxzu4pyL9ESWXlCCqLyTLij56KQVvKveM1ZEWRDCsGd3heMkWc9oPl7ri2O6tin9SlExNTfD0gxclp93j8zvEPBEH4rjxhLT2wNv2HOqQzLPj52raa5UL3zLl7YyvGGkIzSFvh1tfd0wYPPKoHrY6TQqktHEEVBG5MSuYIgXpQ7DanDOaXQQlX"

julia> s2 = s[1:500]*'a'*s[502:end]
"Fl75lZFq7eXmrj2vSEkDouDStaFKJrQiyFICA21qMWdhFBguSrPDn8BPDnuY1ZhDM9gEJJygRl0o3edyCiYodYPTqpZ2yg4UYnJl7ZtdOsSI9ri0PbY7t5vqwbOfn6IJulX3xfdSX2UZSxPhMH0TMVa7trpsjfm8wOskcicVjtzC5qByBlUlH82YSeeWVyQF07xN9wTI8MgKIDnMSoMV4pvJeDgCLqiDZBIzDa3hyyeGIZSBvsrjFyiTqJm7Oxb8eWNhrHnO2FchzlaWbtWJAdWbJa5Ep7cEq428fyu4JPvMDkjuVFcMk2hhf4uF4kfvG76fzLGZG9awOlCwAbwC8WiR8Y4gufxwx3CCrPFbDYDQ2hV2Sk4HLWfe4SZLnZ75ov2fDP26aw1IX3fJSTypUdFS0IRhegB9tK" ⋯ 165 bytes ⋯ "Kmgw4w9lJiPB5a0IWCBv6YCx5ysCTg1x7HpWZivLBcCRHjBtMguqqPdLQdl5t1JbLG4qamfTrMShNFvnRLcK4SeICQQ9GFchAPs6gWnhbExiUhoJdK9cREzDC2olztLPmfZfGnNjC8Jwg3a2yklzhYqB3RymXb4qf7J6SlxCrBPIQD1m01Ym4KsKKMh2H3UWRrd5rPlHnjZHGAjtj6oG7kRACXTQ1gGfsgFgZ51A44KVxzu4pyL9ESWXlCCqLyTLij56KQVvKveM1ZEWRDCsGd3heMkWc9oPl7ri2O6tin9SlExNTfD0gxclp93j8zvEPBEH4rjxhLT2wNv2HOqQzLPj52raa5UL3zLl7YyvGGkIzSFvh1tfd0wYPPKoHrY6TQqktHEEVBG5MSuYIgXpQ7DanDOaXQQlX"

julia> length(s2)
1000

julia> @btime isequal_simple(s, s2)
  269.955 ns (0 allocations: 0 bytes)
false

julia> @btime isequal(s, s2)
  16.542 ns (0 allocations: 0 bytes)
false

julia> @btime isequal_fancy(s, s2)
  270.984 ns (0 allocations: 0 bytes)
false
oscardssmith commented 1 year ago

It looks like the cutoff is around 20 characters. I think it's probable that most strings are under 20 chars so it might be worth seeing if we could save that nanosecond for short strings while keeping the current speed for long ones. It won't be especially easy because branching on the length will likely eat the nanosecond that you save.

chrstphrbrns commented 1 year ago

@oscardssmith has indeed met my challenge, but it's a very special case of identical length + long common prefix

Modifying isequal_fancy to use isequal as a fallback solves that, making it a tiny bit slower for that special case, but still significantly faster for other cases (if only nanoseconds of absolute time)

Seelengrab commented 1 year ago

@oscardssmith which code are you referring to? As far as I can tell, isequal just goes through the usual == and === chain, which should already end up checking the pointers first.

oscardssmith commented 1 year ago

@chrstphrbrns to be clear: the non-identical case is already really fast for our current algorithm, so the only question is to what extent do you prioritize long common prefix (which is the slow case) vs a different first few chars (the fast case). Julia's current algorithm (which is just calling memcmp after checking the lengths and pointers is really fast for the slow case, but does leave some performance on the table in the somewhat common case of equal length strings that are short or disagree early on. A really bad example is

s = randstring(100)
s2 = 'a'*s[2:(end)
]julia> @btime isequal(s, s2)
  18.522 ns (0 allocations: 0 bytes)
julia> @btime isequal_fancy(s, s2)
  3.590 ns (0 allocations: 0 bytes)

The best solution to this is probably to add a check for the first 8 bytes before we do the memcmp. This will be somewhat dificult to do because (as @Seelengrab points out), the code here is in C (specifically) jl_egal__bitstag in builtins.c and emit_f_is in codgegen.cpp

PallHaraldsson commented 1 year ago

It won't be especially easy because branching on the length will likely eat the nanosecond that you save.

You likely want to branch on length longer then 32 (or whatever is the cache-line size), i.e. most often a non-taken branch, since short strings common; and branch on the sum of lengths to avoid one branch.

On my machine for your strings:

julia> @benchmark isequal(s, s2) BenchmarkTools.Trial: 10000 samples with 991 evaluations. Range (min … max): 30.810 ns … 92.918 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 30.882 ns ┊ GC (median): 0.00% Time (mean ± σ): 31.916 ns ± 4.367 ns ┊ GC (mean ± σ): 0.00% ± 0.00%

So I'm a bit surprised I can't match is with:

julia> @benchmark Base._memcmp(s, s2, 1000) BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 33.462 ns … 96.616 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 33.829 ns ┊ GC (median): 0.00% Time (mean ± σ): 34.612 ns ± 3.219 ns ┊ GC (mean ± σ): 0.00% ± 0.00

and I've even gone down to 30.459 ns for @btime isequal(s, s2)

I'm even more surprised I get allocations for (and not for _memcmp that calls it):

julia> @time ccall(:memcmp, Cint, (Ptr{UInt8}, Ptr{UInt8}, Csize_t), s, s2, 1000)
  0.000014 seconds (2 allocations: 32 bytes)

Most strings are short; and most are unequal in length, so we should definitely optimize for that.

Modifying isequal_fancy to use isequal as a fallback solves that

isequal uses _memcmp, and that uses Libc's memcmp that may not actually be optimal, nor good in other ways.

One possible alternative:

https://github.com/jart/cosmopolitan/blob/master/libc/str/timingsafe_memcmp.c

> * Running time is independent of the byte sequences compared, making
> * this safe to use for comparing secret values such as cryptographic
> * MACs. In contrast, memcmp() may short-circuit after finding the first
> * differing byte.
>[..]
> *     timingsafe_bcmp ne n=256            27 ps/byte         35,992 mb/s
> *     timingsafe_bcmp eq n=256            27 ps/byte         35,992 mb/s

Her memcmp is the fastest I know of (AVX and SSE optimized):

https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memcmp.c

/**
 * Compares memory byte by byte.
 *
 *     memcmp n=0                         992 picoseconds
 *     memcmp n=1                           1 ns/byte            738 mb/s
 *     memcmp n=2                         661 ps/byte          1,476 mb/s
 *     memcmp n=3                         551 ps/byte          1,771 mb/s
 *     memcmp n=4                         248 ps/byte          3,936 mb/s
 *     memcmp n=5                         198 ps/byte          4,920 mb/s
 *     memcmp n=6                         165 ps/byte          5,904 mb/s
 *     memcmp n=7                         141 ps/byte          6,889 mb/s
 *     memcmp n=8                         124 ps/byte          7,873 mb/s
 *     memcmp n=9                         110 ps/byte          8,857 mb/s
 *     memcmp n=15                         44 ps/byte         22,143 mb/s
 *     memcmp n=16                         41 ps/byte         23,619 mb/s
 *     memcmp n=17                         77 ps/byte         12,547 mb/s
 *     memcmp n=31                         42 ps/byte         22,881 mb/s
 *     memcmp n=32                         41 ps/byte         23,619 mb/s
 *     memcmp n=33                         60 ps/byte         16,238 mb/s
 *     memcmp n=80                         53 ps/byte         18,169 mb/s
 *     memcmp n=128                        38 ps/byte         25,194 mb/s
 *     memcmp n=256                        32 ps/byte         30,233 mb/s
 *     memcmp n=16384                      27 ps/byte         35,885 mb/s
 *     memcmp n=32768                      29 ps/byte         32,851 mb/s
 *     memcmp n=131072                     33 ps/byte         28,983 mb/s

SubStrings are not aligned in memory, but can we rely on alignment of the beginning of a string (I think yes), and check the end of the string first (may more likely differ, prefixes of strings are often identical), to get a

oscardssmith commented 1 year ago

One interesting thing to note is that implimenting small string optimizations would make this easier to do well since then small strings would be compared fully with 3 integer comparisons.

vtjnash commented 1 year ago

This seems to be a recent regression, since the code was fast in v1.9. Looks like the regression was introduced in v1.10 by 56950e27086.

chrstphrbrns commented 1 year ago

A really bad example is

For some reason, isequal is faster when wrapped in a function, which is why I created isequal_current

julia> s = randstring(100); s2 = 'a'*s[2:(end)];

julia> @btime isequal_current(s,s2)
  5.899 ns (0 allocations: 0 bytes)
false

julia> @btime isequal_fancy(s,s2)
  2.760 ns (0 allocations: 0 bytes)
false
chrstphrbrns commented 1 year ago

47735

If you have a few bits in the pointer to play with, you could use them as a poor-man's hash for fast exit on equal-length non-equal strings (maybe XOR the first and last byte and take the low bits)

Seelengrab commented 1 year ago

If you only use the first & last byte, you're going to have way too many false positives & negatives. By the pigeonhole principle, using a tiny amount of bits in a tagged pointer style fashion is not going to save you from actually checking the entire string anyway.

oscardssmith commented 1 year ago

it might in a lot of cases. if the string strings are roughly randomly distributed, it would tell you the right answer 8191/8192 of the time (, although in reality it won't be nearly that good)

Seelengrab commented 1 year ago

That's not really good enough though, is it? This is isequal, not isapprox after all..

chrstphrbrns commented 1 year ago

Even one bit gives you a 50% exit rate for non-equal strings, assuming the low bit is ~random, which is a big win

Another option is to limit String size (which is practically limited anyways), and use the upper bits of the length for a hash

The idea of only using the first and last byte is to not slow down string creation with hash computation

Seelengrab commented 1 year ago

The low bit likely isn't random though (I don't think strings are uniformly distributed in practice), and using first & last will thrash caches. I think the suggestion to fastpath the first 8/16 bytes above is going to be better, and should produce much the same effect in terms of inequality checking. Sure it's a bit biased towards the front, but you can't really avoid some bias either way, and have to check the whole string in the "equal" case every time. On top of that, shorter strings are more likely than longer strings (IIRC there's some optimizations in GC for strings becuse of that already), so biasing to the front seems preferrable.

mikmoore commented 1 year ago

This seems related to the example in #48665 where one is attempting to balance the value of an early exit with the value of SIMD and unrolling.

Using dead bits in the pointer to hash strings is one possibility, but anyone else could mangle those bits for a different purpose and cause incorrect comparisons. I don't think it's useful enough for how fragile it is. It would be safer to just add a hash field to the string itself.

Different strings are likely to differ early in their data, so in practice one would expect this operation to exit early anyway for different strings. Meanwhile, same-hash strings must be checked exhaustively anyway. Of course one can construct 100k codeunit strings that only differ in the last few characters, but I don't think that's a common use case.

chrstphrbrns commented 1 year ago

This function, which is basically a copy of jl_egal__bitstag logic, is strictly better than isequal, except for very long identical strings (copies, not the same pointer)

julia> isequal_current(a::String,b::String) = isequal(a,b)
isequal_current (generic function with 1 method)

julia> function isequal_extreme(a::String,b::String)
           na = ncodeunits(a)
           nb = ncodeunits(b)
           na != nb && return false
           ccall(:memcmp, Int, (Int,Int,Int), pointer(a),pointer(b),na) == 0
       end
isequal_extreme (generic function with 1 method)

julia> s = ["abc", "abcd", "xyz", randstring(50), randstring(50), "longstring"^1000];

julia> function test(f)
           for i in eachindex(s)
               for j in eachindex(s)
                   if i <= j
                       si,sj = (s[i],string(s[j],""))
                       print("(s$i,s$j)", "    ")
                       @btime $f($si,$sj)
                   end
               end
           end
       end
test (generic function with 1 method)
Inputs Time (isequal_current) Time (isequal_extreme)
(s1,s1) 6.090 ns 4.020 ns
(s1,s2) 4.110 ns 2.840 ns
(s1,s3) 6.120 ns 4.290 ns
(s1,s4) 4.319 ns 2.859 ns
(s1,s5) 4.469 ns 2.859 ns
(s1,s6) 4.230 ns 2.859 ns
(s2,s2) 5.940 ns 4.050 ns
(s2,s3) 4.389 ns 2.859 ns
(s2,s4) 4.340 ns 2.859 ns
(s2,s5) 4.160 ns 2.859 ns
(s2,s6) 4.160 ns 2.840 ns
(s3,s3) 6.190 ns 4.019 ns
(s3,s4) 4.339 ns 2.859 ns
(s3,s5) 4.120 ns 2.840 ns
(s3,s6) 4.130 ns 2.839 ns
(s4,s4) 5.930 ns 3.930 ns
(s4,s5) 5.939 ns 3.960 ns
(s4,s6) 4.139 ns 2.839 ns
(s5,s5) 5.700 ns 3.930 ns
(s5,s6) 4.390 ns 2.839 ns
(s6,s6) 78.192 ns 79.482 ns
vtjnash commented 1 year ago

That seems to align with the regression I noted above caused by https://github.com/JuliaLang/julia/commit/56950e270863c0225b860e313451f70defa82c6d

chrstphrbrns commented 1 year ago

That seems to align with the regression I noted above caused by 56950e2

Here are the results I get from https://github.com/JuliaLang/julia/commit/20b8139ecbc5ae27a005675fcfd6dfd47b34a27c

Input Time (ns)
(s1,s1) 5.709 ns
(s1,s2) 4.090 ns
(s1,s3) 5.670 ns
(s1,s4) 4.230 ns
(s1,s5) 4.060 ns
(s1,s6) 4.070 ns
(s2,s2) 5.530 ns
(s2,s3) 4.250 ns
(s2,s4) 4.119 ns
(s2,s5) 4.090 ns
(s2,s6) 4.090 ns
(s3,s3) 5.709 ns
(s3,s4) 4.069 ns
(s3,s5) 4.110 ns
(s3,s6) 4.080 ns
(s4,s4) 5.610 ns
(s4,s5) 5.440 ns
(s4,s6) 4.080 ns
(s5,s5) 5.660 ns
(s5,s6) 4.279 ns
(s6,s6) 79.349 ns
vtjnash commented 1 year ago

Interesting. I fail to see what that is different from isequal_extreme, since the code appears identical to me.

chrstphrbrns commented 1 year ago

If you take the https://github.com/JuliaLang/julia/commit/20b8139ecbc5ae27a005675fcfd6dfd47b34a27c code and remove the pointer check, which is relatively expensive for a very special case, you get performance on par with isequal_extreme

function ==(a::String, b::String)
    # pointer_from_objref(a) == pointer_from_objref(b) && return true
    al = sizeof(a)
    return al == sizeof(b) && 0 == _memcmp(a, b, al)
end
julia> test(isequal_current)
(s1,s1)      4.000 ns (0 allocations: 0 bytes)
(s1,s2)      2.830 ns (0 allocations: 0 bytes)
(s1,s3)      4.230 ns (0 allocations: 0 bytes)
(s1,s4)      2.820 ns (0 allocations: 0 bytes)
(s1,s5)      2.820 ns (0 allocations: 0 bytes)
(s1,s6)      2.820 ns (0 allocations: 0 bytes)
(s2,s2)      4.000 ns (0 allocations: 0 bytes)
PallHaraldsson commented 1 year ago

All strings are now allocated on the heap, why pointer_from_objref works. In the future they might not be, I have on my TODO list to make a better string type that would not live there, for short strings (InlineStrings does that already and would be stack-allocated, in my idea potentially in a register, so no pointer to be had), and only a pointer to the heap would be used for longer strings, that would be NULL/ignored otherwise.

So I do support pointer equality test gone (at least eventually), but while it seems rather useless now, I do note that when strings are the same it works fast (I'm just not sure we need it fast, would any real-world code check equality with self much?), much faster than any pointer indirection, and the benchmark is flawed, will not show it to be much slower, since all the strings will be hot in cache.

I really like the idea of hashing (it stored in few bits of a tagged pointer) on length and part(s) of the string, i.e. this idea though slightly modified:

The idea of only using the first and last byte is to not slow down string creation with hash computation

Some common strings to be aware of would be of the type "C:\long\path\file.jpeg" and you want to avoid the (possibly) large common prefix when hashing, and the short file-ending, longest about 5?

julia> almost_end_char(s) = @inbounds s[sizeof(s) - 5]  # This would need to be amended to handle tiny strings, or empty...

julia> @btime almost_end_char(raw"C:\long\path\file.jpeg")
  3.728 ns (0 allocations: 0 bytes)
'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)

It's actually a bit slower than I expected, I wasn't expecting a conditional jump, after I eliminated the bounds check, nor:

    movabsq $j_getindex_continued_1586, %rcx
    movl    %eax, %edx
    callq   *%rcx