Open straight-shoota opened 2 years ago
I think the case of IO
is slightly different because unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte. But if we do have 4 bytes in the peek buffer something more optimal than what we have right now could be done.
Then, yes, I think we should have a place with an algorithm that takes a slice of bytes and returns the UTF-8 codepoint and how many bytes it consumed, or an error otherwise.
Also, to use that efficient algorithm you linked, we have to have a way to put that table into ROM, but there's no way we can do that right now (we need const [...]
)
unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte.
The example algorithm I reference consumes a single byte and the status indicates whether there was an error, a sequence was complete or more bytes need to be consumed. This is necessary when reading bytes from a stream. But also for bounds checking when a string (with potentially invalid UTF-8 encoding) is not null terminated.
Couldn't the lookup tables be put into a class variable on String or something? Or does that have too much overhead?
It is my understanding there are far more efficient algorithms for UTF-8 decoding.
I looked into this a while ago and the strategy that is currently used (at least for string utf8 validation) is actually supposed to be faster than the strategy that is based on a single lookup table, which seems like the one that is linked in the description. Or at least possible to make faster. If I remember correct the reason was that the cpu can parallelize the execution of the instructions better even if it does execute a few more instructions. But it was a while ago I read about it so I may misremember.
However, there are some modern approaches which are supposed to improve things a bit, by using two tables instead of one. Recommended references: https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)
Hmm. I tried profiling the following program:
long_string = "a" * (2**30)
10.times { p long_string.valid_encoding? }
This takes 24 seconds.
With the following monkeypatch, this reduces to 5.4s: (compiled with -release --mcpu znver2
)
struct Char::Reader
@[AlwaysInline]
def next_char : Char
@pos &+= @current_char_width
if @pos > @string.bytesize
raise IndexError.new
end
decode_current_char
end
end
The bulk of the change is due to the @[AlwaysInline]
line, but the &+=
also gives a second or two. The --mcpu
flag is also worth seconds - I suppose it enables some vector instructions.
That is a quite ridiculous difference. Unfortunately I have no idea what actually happens code generation wise, as the @[AlwaysInline]
seems to make the dwarf row number information simply disappear somehow, in a way I haven't seen before with regular inlining initiated by the compiler. don't know if it is due to the annotation or due to the optimizations that happens.
https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)
Here is a rather straightforward Crystal port of this new String#valid_encoding?
, without the loop unrolling:
class String
VALID_ENCODING_TABLE = begin
x = Array(UInt64).new(256)
# the same DFA transition table, with state 1 hidden:
#
# accepted
# | 1 continuation byte left
# | | 2 continuation bytes left
# | | | E0-?? ??; disallow overlong encodings up to U+07FF
# | | | | ED-?? ??; disallow surrogate pairs
# | | | | | F0-?? ?? ??; disallow overlong encodings up to U+FFFF
# | | | | | | 3 continuation bytes left
# | | | | | | | F4-?? ?? ??; disallow codepoints above U+10FFFF
# v v v v v v v v
#
# | 0 2 3 4 5 6 7 8
# -----------+----------------
# 0x00..0x7F | 0 _ _ _ _ _ _ _
# 0x80..0x8F | _ 0 2 _ 2 _ 3 3
# 0x90..0x9F | _ 0 2 _ 2 3 3 _
# 0xA0..0xBF | _ 0 2 2 _ 3 3 _
# 0xC2..0xDF | 2 _ _ _ _ _ _ _
# 0xE0..0xE0 | 4 _ _ _ _ _ _ _
# 0xE1..0xEC | 3 _ _ _ _ _ _ _
# 0xED..0xED | 5 _ _ _ _ _ _ _
# 0xEE..0xEF | 3 _ _ _ _ _ _ _
# 0xF0..0xF0 | 6 _ _ _ _ _ _ _
# 0xF1..0xF3 | 7 _ _ _ _ _ _ _
# 0xF4..0xF4 | 8 _ _ _ _ _ _ _
{% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3)); {% end %}
{% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1)); {% end %}
{% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1)); {% end %}
{% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
x
end
private def self.put(array, x)
array << x
end
private macro dfa_state(*transitions)
{% x = 0_u64 %}
{% for tr, i in transitions %}
{% x |= (1_u64 << (i * 6)) * tr * 6 %}
{% end %}
{{ x }}
end
def valid_encoding2? : Bool
state = 0_u64
table = VALID_ENCODING_TABLE.to_unsafe
s = to_unsafe
e = s + bytesize
while s < e
state = table[s.value].unsafe_shr(state & 0x3F)
return false if state & 0x3F == 6
s += 1
end
state & 0x3F == 0
end
end
Here is a global fun that does the same thing:
@[NoInline]
fun foo_valid_encoding(str : Void*) : Bool
str = str.as(String*)
state = 0_u64
table = String::VALID_ENCODING_TABLE.to_unsafe
s = str.value.to_unsafe
e = s + str.value.bytesize
while s < e
state = table[s.value].unsafe_shr(state & 0x3F)
return false if state & 0x3F == 6
s += 1
end
state & 0x3F == 0
end
Its generated assembly under --mcpu=native
is:
foo_valid_encoding:
movq (%rdi), %rax
movslq 4(%rax), %rdx
testq %rdx, %rdx
jle .LBB299_1
movq "String::VALID_ENCODING_TABLE"(%rip), %rcx
movq 16(%rcx), %rcx
addq %rax, %rdx
addq $12, %rdx
addq $12, %rax
xorl %esi, %esi
.LBB299_4:
movzbl (%rax), %edi
shrxq %rsi, (%rcx,%rdi,8), %rsi
andl $63, %esi
cmpq $6, %rsi
je .LBB299_5
incq %rax
cmpq %rdx, %rax
jb .LBB299_4
testq %rsi, %rsi
sete %al
retq
.LBB299_1:
movb $1, %al
retq
.LBB299_5:
xorl %eax, %eax
retq
It appears to be really fast, since it reproduces the shrxq %rsi, (%rcx,%rdi,8), %rsi
line. It should not matter if the DFA was placed in read-only memory or not. In reality some kind of recovering mechanism is also needed to perform correct backtracking in case of errors.
The unrolling only makes sense in the context of #valid_encoding?
. If we do anything else within the loop its performance improvements will most likely diminish.
I don't think the SIMD example is of much relevance because it is still restricted to #valid_encoding?
and the compiler offers no SIMD support at the moment.
And here is an optimized version of #each_char(&)
, borrowing ideas from both DFA examples:
class String
VALID_ENCODING_TABLE = begin
x = Array(UInt64).new(256)
# each transition is also associated with a mask that indicates which bits
# of the current byte are used to form an UTF-8 codepoint; this is possible
# because every legal UTF-8 byte uniquely determines whether itself is
# leading or trailing, and in the former case, the number of trailing bytes
# thus only bits 54 and 55 remain unused in the DFA representation
{% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x7F)); {% end %}
{% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3, mask: 0x3F)); {% end %}
{% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1, mask: 0x3F)); {% end %}
{% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1, mask: 0x3F)); {% end %}
{% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
{% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x3F)); {% end %}
{% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
x
end
def each_char2(& : Char ->) : Nil
state = 0_u64
table = VALID_ENCODING_TABLE.to_unsafe
ch = 0_u32
s = to_unsafe
e = s + bytesize
prev = s
while prev < e
ch = ch.unsafe_shl(6) | (s.value & table[s.value].unsafe_shr(56))
state = table[s.value].unsafe_shr(state & 0x3F)
case state & 0x3F
when 0
s += 1
yield ch.unsafe_chr
prev = s
ch = 0_u32
when 6
yield Char::REPLACEMENT
state = 0_u64
prev += 1
ch = 0_u32
s = prev
else
s += 1
end
end
end
end
From my own tests it is only about twice as fast as the existing implementation. The new bottleneck might be the computation of ch
.
How are you benchmarking? In my silly benchmarks I get a 100 times worse performance with each_char2
compared to the original.
valid_encoding?
perform really well though. Roughly as fast as the @AlwaysInline
version I posted above, but a lot more stable in performance.
interesting to see comparison with https://github.com/simdutf/simdutf
i run on files from simdutf benchmark. valid_encoding2?
average 4.78 times faster than valid_encoding?
, simdutf::validate_utf8
average 28.38 times faster than valid_encoding?
Several places in stdlib deal with UTF-8 decoding:
String.char_bytesize_at
Char::Reader#decode_char_at
IO#read_char_with_bytesize
They are three different implementations of pretty much the same algorithm. The differences are attributed to their respective domain and are probably insignificant for the UTF-8 decoding algorithm itself.
The algorithm is not very performant, though. It is my understanding there are far more efficient algorithms for UTF-8 decoding. One example of such is this: http://bjoern.hoehrmann.de/utf-8/decoder/dfa I did not conduct a search for potential other options, yet. We can consider stand-alone implementations like the one I mentioned, as well as look at what other text libraries do (including stdlib implementations of other languages).
Considering that UTF-8 decoding plays a major part in string operations, this would certainly have a positive effect on many Crystal programs.