Optimize UTF-8 decoding

straight-shoota commented 2 years ago

Several places in stdlib deal with UTF-8 decoding:

String.char_bytesize_at
Char::Reader#decode_char_at
IO#read_char_with_bytesize

They are three different implementations of pretty much the same algorithm. The differences are attributed to their respective domain and are probably insignificant for the UTF-8 decoding algorithm itself.

The algorithm is not very performant, though. It is my understanding there are far more efficient algorithms for UTF-8 decoding. One example of such is this: http://bjoern.hoehrmann.de/utf-8/decoder/dfa I did not conduct a search for potential other options, yet. We can consider stand-alone implementations like the one I mentioned, as well as look at what other text libraries do (including stdlib implementations of other languages).

Considering that UTF-8 decoding plays a major part in string operations, this would certainly have a positive effect on many Crystal programs.

asterite commented 2 years ago

I think the case of IO is slightly different because unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte. But if we do have 4 bytes in the peek buffer something more optimal than what we have right now could be done.

Then, yes, I think we should have a place with an algorithm that takes a slice of bytes and returns the UTF-8 codepoint and how many bytes it consumed, or an error otherwise.

asterite commented 2 years ago

Also, to use that efficient algorithm you linked, we have to have a way to put that table into ROM, but there's no way we can do that right now (we need const [...])

straight-shoota commented 2 years ago

unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte.

The example algorithm I reference consumes a single byte and the status indicates whether there was an error, a sequence was complete or more bytes need to be consumed. This is necessary when reading bytes from a stream. But also for bounds checking when a string (with potentially invalid UTF-8 encoding) is not null terminated.

yxhuvud commented 2 years ago

Couldn't the lookup tables be put into a class variable on String or something? Or does that have too much overhead?

It is my understanding there are far more efficient algorithms for UTF-8 decoding.

I looked into this a while ago and the strategy that is currently used (at least for string utf8 validation) is actually supposed to be faster than the strategy that is based on a single lookup table, which seems like the one that is linked in the description. Or at least possible to make faster. If I remember correct the reason was that the cpu can parallelize the execution of the instructions better even if it does execute a few more instructions. But it was a while ago I read about it so I may misremember.

However, there are some modern approaches which are supposed to improve things a bit, by using two tables instead of one. Recommended references: https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

yxhuvud commented 2 years ago

Hmm. I tried profiling the following program:

long_string = "a" * (2**30)
10.times { p long_string.valid_encoding? }

This takes 24 seconds. With the following monkeypatch, this reduces to 5.4s: (compiled with -release --mcpu znver2)

struct Char::Reader
  @[AlwaysInline]
  def next_char : Char
    @pos &+= @current_char_width
    if @pos > @string.bytesize
      raise IndexError.new
    end

    decode_current_char
  end
end

The bulk of the change is due to the @[AlwaysInline] line, but the &+= also gives a second or two. The --mcpu flag is also worth seconds - I suppose it enables some vector instructions.

That is a quite ridiculous difference. Unfortunately I have no idea what actually happens code generation wise, as the @[AlwaysInline] seems to make the dwarf row number information simply disappear somehow, in a way I haven't seen before with regular inlining initiated by the compiler. don't know if it is due to the annotation or due to the optimizations that happens.

HertzDevil commented 2 years ago

https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

Here is a rather straightforward Crystal port of this new String#valid_encoding?, without the loop unrolling:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # the same DFA transition table, with state 1 hidden:
    #
    #              accepted
    #              | 1 continuation byte left
    #              | | 2 continuation bytes left
    #              | | | E0-?? ??; disallow overlong encodings up to U+07FF
    #              | | | | ED-?? ??; disallow surrogate pairs
    #              | | | | | F0-?? ?? ??; disallow overlong encodings up to U+FFFF
    #              | | | | | | 3 continuation bytes left
    #              | | | | | | | F4-?? ?? ??; disallow codepoints above U+10FFFF
    #              v v v v v v v v
    #
    #            | 0 2 3 4 5 6 7 8
    # -----------+----------------
    # 0x00..0x7F | 0 _ _ _ _ _ _ _
    # 0x80..0x8F | _ 0 2 _ 2 _ 3 3
    # 0x90..0x9F | _ 0 2 _ 2 3 3 _
    # 0xA0..0xBF | _ 0 2 2 _ 3 3 _
    # 0xC2..0xDF | 2 _ _ _ _ _ _ _
    # 0xE0..0xE0 | 4 _ _ _ _ _ _ _
    # 0xE1..0xEC | 3 _ _ _ _ _ _ _
    # 0xED..0xED | 5 _ _ _ _ _ _ _
    # 0xEE..0xEF | 3 _ _ _ _ _ _ _
    # 0xF0..0xF0 | 6 _ _ _ _ _ _ _
    # 0xF1..0xF3 | 7 _ _ _ _ _ _ _
    # 0xF4..0xF4 | 8 _ _ _ _ _ _ _

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}

    x
  end

  private def self.put(array, x)
    array << x
  end

  private macro dfa_state(*transitions)
    {% x = 0_u64 %}
    {% for tr, i in transitions %}
      {% x |= (1_u64 << (i * 6)) * tr * 6 %}
    {% end %}
    {{ x }}
  end

  def valid_encoding2? : Bool
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    s = to_unsafe
    e = s + bytesize

    while s < e
      state = table[s.value].unsafe_shr(state & 0x3F)
      return false if state & 0x3F == 6
      s += 1
    end

    state & 0x3F == 0
  end
end

Here is a global fun that does the same thing:

@[NoInline]
fun foo_valid_encoding(str : Void*) : Bool
  str = str.as(String*)
  state = 0_u64
  table = String::VALID_ENCODING_TABLE.to_unsafe
  s = str.value.to_unsafe
  e = s + str.value.bytesize

  while s < e
    state = table[s.value].unsafe_shr(state & 0x3F)
    return false if state & 0x3F == 6
    s += 1
  end

  state & 0x3F == 0
end

Its generated assembly under --mcpu=native is:

foo_valid_encoding:
    movq    (%rdi), %rax
    movslq  4(%rax), %rdx
    testq   %rdx, %rdx
    jle .LBB299_1
    movq    "String::VALID_ENCODING_TABLE"(%rip), %rcx
    movq    16(%rcx), %rcx
    addq    %rax, %rdx
    addq    $12, %rdx
    addq    $12, %rax
    xorl    %esi, %esi
.LBB299_4:
    movzbl  (%rax), %edi
    shrxq   %rsi, (%rcx,%rdi,8), %rsi
    andl    $63, %esi
    cmpq    $6, %rsi
    je  .LBB299_5
    incq    %rax
    cmpq    %rdx, %rax
    jb  .LBB299_4
    testq   %rsi, %rsi
    sete    %al
    retq
.LBB299_1:
    movb    $1, %al
    retq
.LBB299_5:
    xorl    %eax, %eax
    retq

It appears to be really fast, since it reproduces the shrxq %rsi, (%rcx,%rdi,8), %rsi line. It should not matter if the DFA was placed in read-only memory or not. In reality some kind of recovering mechanism is also needed to perform correct backtracking in case of errors.

The unrolling only makes sense in the context of #valid_encoding?. If we do anything else within the loop its performance improvements will most likely diminish.

I don't think the SIMD example is of much relevance because it is still restricted to #valid_encoding? and the compiler offers no SIMD support at the moment.

HertzDevil commented 2 years ago

And here is an optimized version of #each_char(&), borrowing ideas from both DFA examples:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # each transition is also associated with a mask that indicates which bits
    # of the current byte are used to form an UTF-8 codepoint; this is possible
    # because every legal UTF-8 byte uniquely determines whether itself is
    # leading or trailing, and in the former case, the number of trailing bytes
    # thus only bits 54 and 55 remain unused in the DFA representation

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x7F)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3, mask: 0x3F)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}

    x
  end

  def each_char2(& : Char ->) : Nil
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    ch = 0_u32
    s = to_unsafe
    e = s + bytesize
    prev = s

    while prev < e
      ch = ch.unsafe_shl(6) | (s.value & table[s.value].unsafe_shr(56))
      state = table[s.value].unsafe_shr(state & 0x3F)
      case state & 0x3F
      when 0
        s += 1
        yield ch.unsafe_chr
        prev = s
        ch = 0_u32
      when 6
        yield Char::REPLACEMENT
        state = 0_u64
        prev += 1
        ch = 0_u32
        s = prev
      else
        s += 1
      end
    end
  end
end

From my own tests it is only about twice as fast as the existing implementation. The new bottleneck might be the computation of ch.

yxhuvud commented 2 years ago

How are you benchmarking? In my silly benchmarks I get a 100 times worse performance with each_char2 compared to the original.

valid_encoding? perform really well though. Roughly as fast as the @AlwaysInline version I posted above, but a lot more stable in performance.

kostya commented 2 years ago

interesting to see comparison with https://github.com/simdutf/simdutf

kostya commented 2 years ago

i run on files from simdutf benchmark. valid_encoding2? average 4.78 times faster than valid_encoding?, simdutf::validate_utf8 average 28.38 times faster than valid_encoding?

crystal-lang / crystal

Optimize UTF-8 decoding #11873