Open chrstphrbrns opened 1 year ago
Function | (s, s) | (s, s_same) | (s, s_different) | (s, s_longer) |
---|---|---|---|---|
isequal_current | 2.589 | 6.569 | 6.570 | 4.469 |
isequal_simple | 4.939 | 5.210 | 4.419 | 4.000 |
isequal_fancy | 5.240 | 5.500 | 2.830 | 3.079 |
What exactly is the issue here? String comparisons have to handle many more different cases than just lengths 3 & 4. It's a balancing act between longer & shorter lengths, and if we're one nanosecond slower on such short strings but faster on longer ones, I'd say that's a worthwhile tradeoff.
It does seem like switching the check for egal and length might be worthwhile. It seems very plausible to me that string lengths are different more often than two identical strings are compared. (although obviously the performance difference will be minuscule)
That's a fair point. Is there any evidence that the current implementation is optimal (given whatever assumptions)?
Even as I experiment with longer strings, it seems the only case where the current implementation wins is when the strings are identical, and that's presumably just by doing a pointer check, which the others could do as well
julia> s = randstring(1000)
"Fl75lZFq7eXmrj2vSEkDouDStaFKJrQiyFICA21qMWdhFBguSrPDn8BPDnuY1ZhDM9gEJJygRl0o3edyCiYodYPTqpZ2yg4UYnJl7ZtdOsSI9ri0PbY7t5vqwbOfn6IJulX3xfdSX2UZSxPhMH0TMVa7trpsjfm8wOskcicVjtzC5qByBlUlH82YSeeWVyQF07xN9wTI8MgKIDnMSoMV4pvJeDgCLqiDZBIzDa3hyyeGIZSBvsrjFyiTqJm7Oxb8eWNhrHnO2FchzlaWbtWJAdWbJa5Ep7cEq428fyu4JPvMDkjuVFcMk2hhf4uF4kfvG76fzLGZG9awOlCwAbwC8WiR8Y4gufxwx3CCrPFbDYDQ2hV2Sk4HLWfe4SZLnZ75ov2fDP26aw1IX3fJSTypUdFS0IRhegB9tK" ⋯ 165 bytes ⋯ "Kmgw4w9lJiPB5a0IWCBv6YCx5ysCTg1x7HpWZivLBcCRHjBtMguqqPdLQdl5t1JbLG4qamfTrMShNFvnRLcK4SeICQQ9GFchAPs6gWnhbExiUhoJdK9cREzDC2olztLPmfZfGnNjC8Jwg3a2yklzhYqB3RymXb4qf7J6SlxCrBPIQD1m01Ym4KsKKMh2H3UWRrd5rPlHnjZHGAjtj6oG7kRACXTQ1gGfsgFgZ51A44KVxzu4pyL9ESWXlCCqLyTLij56KQVvKveM1ZEWRDCsGd3heMkWc9oPl7ri2O6tin9SlExNTfD0gxclp93j8zvEPBEH4rjxhLT2wNv2HOqQzLPj52raa5UL3zLl7YyvGGkIzSFvh1tfd0wYPPKoHrY6TQqktHEEVBG5MSuYIgXpQ7DanDOaXQQlX"
julia> s2 = s[1:500]*'a'*s[502:end]
"Fl75lZFq7eXmrj2vSEkDouDStaFKJrQiyFICA21qMWdhFBguSrPDn8BPDnuY1ZhDM9gEJJygRl0o3edyCiYodYPTqpZ2yg4UYnJl7ZtdOsSI9ri0PbY7t5vqwbOfn6IJulX3xfdSX2UZSxPhMH0TMVa7trpsjfm8wOskcicVjtzC5qByBlUlH82YSeeWVyQF07xN9wTI8MgKIDnMSoMV4pvJeDgCLqiDZBIzDa3hyyeGIZSBvsrjFyiTqJm7Oxb8eWNhrHnO2FchzlaWbtWJAdWbJa5Ep7cEq428fyu4JPvMDkjuVFcMk2hhf4uF4kfvG76fzLGZG9awOlCwAbwC8WiR8Y4gufxwx3CCrPFbDYDQ2hV2Sk4HLWfe4SZLnZ75ov2fDP26aw1IX3fJSTypUdFS0IRhegB9tK" ⋯ 165 bytes ⋯ "Kmgw4w9lJiPB5a0IWCBv6YCx5ysCTg1x7HpWZivLBcCRHjBtMguqqPdLQdl5t1JbLG4qamfTrMShNFvnRLcK4SeICQQ9GFchAPs6gWnhbExiUhoJdK9cREzDC2olztLPmfZfGnNjC8Jwg3a2yklzhYqB3RymXb4qf7J6SlxCrBPIQD1m01Ym4KsKKMh2H3UWRrd5rPlHnjZHGAjtj6oG7kRACXTQ1gGfsgFgZ51A44KVxzu4pyL9ESWXlCCqLyTLij56KQVvKveM1ZEWRDCsGd3heMkWc9oPl7ri2O6tin9SlExNTfD0gxclp93j8zvEPBEH4rjxhLT2wNv2HOqQzLPj52raa5UL3zLl7YyvGGkIzSFvh1tfd0wYPPKoHrY6TQqktHEEVBG5MSuYIgXpQ7DanDOaXQQlX"
julia> length(s2)
1000
julia> @btime isequal_simple(s, s2)
269.955 ns (0 allocations: 0 bytes)
false
julia> @btime isequal(s, s2)
16.542 ns (0 allocations: 0 bytes)
false
julia> @btime isequal_fancy(s, s2)
270.984 ns (0 allocations: 0 bytes)
false
It looks like the cutoff is around 20 characters. I think it's probable that most strings are under 20 chars so it might be worth seeing if we could save that nanosecond for short strings while keeping the current speed for long ones. It won't be especially easy because branching on the length will likely eat the nanosecond that you save.
@oscardssmith has indeed met my challenge, but it's a very special case of identical length + long common prefix
Modifying isequal_fancy
to use isequal
as a fallback solves that, making it a tiny bit slower for that special case, but still significantly faster for other cases (if only nanoseconds of absolute time)
@oscardssmith which code are you referring to? As far as I can tell, isequal
just goes through the usual ==
and ===
chain, which should already end up checking the pointers first.
@chrstphrbrns to be clear: the non-identical case is already really fast for our current algorithm, so the only question is to what extent do you prioritize long common prefix (which is the slow case) vs a different first few chars (the fast case). Julia's current algorithm (which is just calling memcmp
after checking the lengths and pointers is really fast for the slow case, but does leave some performance on the table in the somewhat common case of equal length strings that are short or disagree early on. A really bad example is
s = randstring(100)
s2 = 'a'*s[2:(end)
]julia> @btime isequal(s, s2)
18.522 ns (0 allocations: 0 bytes)
julia> @btime isequal_fancy(s, s2)
3.590 ns (0 allocations: 0 bytes)
The best solution to this is probably to add a check for the first 8 bytes before we do the memcmp
. This will be somewhat dificult to do because (as @Seelengrab points out), the code here is in C (specifically) jl_egal__bitstag
in builtins.c and emit_f_is
in codgegen.cpp
It won't be especially easy because branching on the length will likely eat the nanosecond that you save.
You likely want to branch on length longer then 32 (or whatever is the cache-line size), i.e. most often a non-taken branch, since short strings common; and branch on the sum of lengths to avoid one branch.
On my machine for your strings:
julia> @benchmark isequal(s, s2) BenchmarkTools.Trial: 10000 samples with 991 evaluations. Range (min … max): 30.810 ns … 92.918 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 30.882 ns ┊ GC (median): 0.00% Time (mean ± σ): 31.916 ns ± 4.367 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
So I'm a bit surprised I can't match is with:
julia> @benchmark Base._memcmp(s, s2, 1000) BenchmarkTools.Trial: 10000 samples with 993 evaluations. Range (min … max): 33.462 ns … 96.616 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 33.829 ns ┊ GC (median): 0.00% Time (mean ± σ): 34.612 ns ± 3.219 ns ┊ GC (mean ± σ): 0.00% ± 0.00
and I've even gone down to 30.459 ns for @btime isequal(s, s2)
I'm even more surprised I get allocations for (and not for _memcmp that calls it):
julia> @time ccall(:memcmp, Cint, (Ptr{UInt8}, Ptr{UInt8}, Csize_t), s, s2, 1000)
0.000014 seconds (2 allocations: 32 bytes)
Most strings are short; and most are unequal in length, so we should definitely optimize for that.
Modifying isequal_fancy to use isequal as a fallback solves that
isequal uses _memcmp, and that uses Libc's memcmp that may not actually be optimal, nor good in other ways.
One possible alternative:
https://github.com/jart/cosmopolitan/blob/master/libc/str/timingsafe_memcmp.c
> * Running time is independent of the byte sequences compared, making
> * this safe to use for comparing secret values such as cryptographic
> * MACs. In contrast, memcmp() may short-circuit after finding the first
> * differing byte.
>[..]
> * timingsafe_bcmp ne n=256 27 ps/byte 35,992 mb/s
> * timingsafe_bcmp eq n=256 27 ps/byte 35,992 mb/s
Her memcmp is the fastest I know of (AVX and SSE optimized):
https://github.com/jart/cosmopolitan/blob/master/libc/intrin/memcmp.c
/**
* Compares memory byte by byte.
*
* memcmp n=0 992 picoseconds
* memcmp n=1 1 ns/byte 738 mb/s
* memcmp n=2 661 ps/byte 1,476 mb/s
* memcmp n=3 551 ps/byte 1,771 mb/s
* memcmp n=4 248 ps/byte 3,936 mb/s
* memcmp n=5 198 ps/byte 4,920 mb/s
* memcmp n=6 165 ps/byte 5,904 mb/s
* memcmp n=7 141 ps/byte 6,889 mb/s
* memcmp n=8 124 ps/byte 7,873 mb/s
* memcmp n=9 110 ps/byte 8,857 mb/s
* memcmp n=15 44 ps/byte 22,143 mb/s
* memcmp n=16 41 ps/byte 23,619 mb/s
* memcmp n=17 77 ps/byte 12,547 mb/s
* memcmp n=31 42 ps/byte 22,881 mb/s
* memcmp n=32 41 ps/byte 23,619 mb/s
* memcmp n=33 60 ps/byte 16,238 mb/s
* memcmp n=80 53 ps/byte 18,169 mb/s
* memcmp n=128 38 ps/byte 25,194 mb/s
* memcmp n=256 32 ps/byte 30,233 mb/s
* memcmp n=16384 27 ps/byte 35,885 mb/s
* memcmp n=32768 29 ps/byte 32,851 mb/s
* memcmp n=131072 33 ps/byte 28,983 mb/s
SubString
s are not aligned in memory, but can we rely on alignment of the beginning of a string (I think yes), and check the end of the string first (may more likely differ, prefixes of strings are often identical), to get a
One interesting thing to note is that implimenting small string optimizations would make this easier to do well since then small strings would be compared fully with 3 integer comparisons.
This seems to be a recent regression, since the code was fast in v1.9. Looks like the regression was introduced in v1.10 by 56950e27086.
A really bad example is
For some reason, isequal
is faster when wrapped in a function, which is why I created isequal_current
julia> s = randstring(100); s2 = 'a'*s[2:(end)];
julia> @btime isequal_current(s,s2)
5.899 ns (0 allocations: 0 bytes)
false
julia> @btime isequal_fancy(s,s2)
2.760 ns (0 allocations: 0 bytes)
false
If you have a few bits in the pointer to play with, you could use them as a poor-man's hash for fast exit on equal-length non-equal strings (maybe XOR the first and last byte and take the low bits)
If you only use the first & last byte, you're going to have way too many false positives & negatives. By the pigeonhole principle, using a tiny amount of bits in a tagged pointer style fashion is not going to save you from actually checking the entire string anyway.
it might in a lot of cases. if the string strings are roughly randomly distributed, it would tell you the right answer 8191/8192 of the time (, although in reality it won't be nearly that good)
That's not really good enough though, is it? This is isequal
, not isapprox
after all..
Even one bit gives you a 50% exit rate for non-equal strings, assuming the low bit is ~random, which is a big win
Another option is to limit String size (which is practically limited anyways), and use the upper bits of the length for a hash
The idea of only using the first and last byte is to not slow down string creation with hash computation
The low bit likely isn't random though (I don't think strings are uniformly distributed in practice), and using first & last will thrash caches. I think the suggestion to fastpath the first 8/16 bytes above is going to be better, and should produce much the same effect in terms of inequality checking. Sure it's a bit biased towards the front, but you can't really avoid some bias either way, and have to check the whole string in the "equal" case every time. On top of that, shorter strings are more likely than longer strings (IIRC there's some optimizations in GC for strings becuse of that already), so biasing to the front seems preferrable.
This seems related to the example in #48665 where one is attempting to balance the value of an early exit with the value of SIMD and unrolling.
Using dead bits in the pointer to hash strings is one possibility, but anyone else could mangle those bits for a different purpose and cause incorrect comparisons. I don't think it's useful enough for how fragile it is. It would be safer to just add a hash field to the string itself.
Different strings are likely to differ early in their data, so in practice one would expect this operation to exit early anyway for different strings. Meanwhile, same-hash strings must be checked exhaustively anyway. Of course one can construct 100k codeunit strings that only differ in the last few characters, but I don't think that's a common use case.
This function, which is basically a copy of jl_egal__bitstag
logic, is strictly better than isequal
, except for very long identical strings (copies, not the same pointer)
julia> isequal_current(a::String,b::String) = isequal(a,b)
isequal_current (generic function with 1 method)
julia> function isequal_extreme(a::String,b::String)
na = ncodeunits(a)
nb = ncodeunits(b)
na != nb && return false
ccall(:memcmp, Int, (Int,Int,Int), pointer(a),pointer(b),na) == 0
end
isequal_extreme (generic function with 1 method)
julia> s = ["abc", "abcd", "xyz", randstring(50), randstring(50), "longstring"^1000];
julia> function test(f)
for i in eachindex(s)
for j in eachindex(s)
if i <= j
si,sj = (s[i],string(s[j],""))
print("(s$i,s$j)", " ")
@btime $f($si,$sj)
end
end
end
end
test (generic function with 1 method)
Inputs | Time (isequal_current) | Time (isequal_extreme) |
---|---|---|
(s1,s1) | 6.090 ns | 4.020 ns |
(s1,s2) | 4.110 ns | 2.840 ns |
(s1,s3) | 6.120 ns | 4.290 ns |
(s1,s4) | 4.319 ns | 2.859 ns |
(s1,s5) | 4.469 ns | 2.859 ns |
(s1,s6) | 4.230 ns | 2.859 ns |
(s2,s2) | 5.940 ns | 4.050 ns |
(s2,s3) | 4.389 ns | 2.859 ns |
(s2,s4) | 4.340 ns | 2.859 ns |
(s2,s5) | 4.160 ns | 2.859 ns |
(s2,s6) | 4.160 ns | 2.840 ns |
(s3,s3) | 6.190 ns | 4.019 ns |
(s3,s4) | 4.339 ns | 2.859 ns |
(s3,s5) | 4.120 ns | 2.840 ns |
(s3,s6) | 4.130 ns | 2.839 ns |
(s4,s4) | 5.930 ns | 3.930 ns |
(s4,s5) | 5.939 ns | 3.960 ns |
(s4,s6) | 4.139 ns | 2.839 ns |
(s5,s5) | 5.700 ns | 3.930 ns |
(s5,s6) | 4.390 ns | 2.839 ns |
(s6,s6) | 78.192 ns | 79.482 ns |
That seems to align with the regression I noted above caused by https://github.com/JuliaLang/julia/commit/56950e270863c0225b860e313451f70defa82c6d
That seems to align with the regression I noted above caused by 56950e2
Here are the results I get from https://github.com/JuliaLang/julia/commit/20b8139ecbc5ae27a005675fcfd6dfd47b34a27c
Input | Time (ns) |
---|---|
(s1,s1) | 5.709 ns |
(s1,s2) | 4.090 ns |
(s1,s3) | 5.670 ns |
(s1,s4) | 4.230 ns |
(s1,s5) | 4.060 ns |
(s1,s6) | 4.070 ns |
(s2,s2) | 5.530 ns |
(s2,s3) | 4.250 ns |
(s2,s4) | 4.119 ns |
(s2,s5) | 4.090 ns |
(s2,s6) | 4.090 ns |
(s3,s3) | 5.709 ns |
(s3,s4) | 4.069 ns |
(s3,s5) | 4.110 ns |
(s3,s6) | 4.080 ns |
(s4,s4) | 5.610 ns |
(s4,s5) | 5.440 ns |
(s4,s6) | 4.080 ns |
(s5,s5) | 5.660 ns |
(s5,s6) | 4.279 ns |
(s6,s6) | 79.349 ns |
Interesting. I fail to see what that is different from isequal_extreme, since the code appears identical to me.
If you take the https://github.com/JuliaLang/julia/commit/20b8139ecbc5ae27a005675fcfd6dfd47b34a27c code and remove the pointer check, which is relatively expensive for a very special case, you get performance on par with isequal_extreme
function ==(a::String, b::String)
# pointer_from_objref(a) == pointer_from_objref(b) && return true
al = sizeof(a)
return al == sizeof(b) && 0 == _memcmp(a, b, al)
end
julia> test(isequal_current)
(s1,s1) 4.000 ns (0 allocations: 0 bytes)
(s1,s2) 2.830 ns (0 allocations: 0 bytes)
(s1,s3) 4.230 ns (0 allocations: 0 bytes)
(s1,s4) 2.820 ns (0 allocations: 0 bytes)
(s1,s5) 2.820 ns (0 allocations: 0 bytes)
(s1,s6) 2.820 ns (0 allocations: 0 bytes)
(s2,s2) 4.000 ns (0 allocations: 0 bytes)
All strings are now allocated on the heap, why pointer_from_objref works. In the future they might not be, I have on my TODO list to make a better string type that would not live there, for short strings (InlineString
s does that already and would be stack-allocated, in my idea potentially in a register, so no pointer to be had), and only a pointer to the heap would be used for longer strings, that would be NULL/ignored otherwise.
So I do support pointer equality test gone (at least eventually), but while it seems rather useless now, I do note that when strings are the same it works fast (I'm just not sure we need it fast, would any real-world code check equality with self much?), much faster than any pointer indirection, and the benchmark is flawed, will not show it to be much slower, since all the strings will be hot in cache.
I really like the idea of hashing (it stored in few bits of a tagged pointer) on length and part(s) of the string, i.e. this idea though slightly modified:
The idea of only using the first and last byte is to not slow down string creation with hash computation
Some common strings to be aware of would be of the type "C:\long\path\file.jpeg" and you want to avoid the (possibly) large common prefix when hashing, and the short file-ending, longest about 5?
julia> almost_end_char(s) = @inbounds s[sizeof(s) - 5] # This would need to be amended to handle tiny strings, or empty...
julia> @btime almost_end_char(raw"C:\long\path\file.jpeg")
3.728 ns (0 allocations: 0 bytes)
'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
It's actually a bit slower than I expected, I wasn't expecting a conditional jump, after I eliminated the bounds check, nor:
movabsq $j_getindex_continued_1586, %rcx
movl %eax, %edx
callq *%rcx