dart-lang / yaml

A Dart YAML parser.
https://pub.dev/packages/yaml
MIT License
169 stars 58 forks source link

Handle surrogate pairs during scanning #159

Closed tamcy closed 1 month ago

tamcy commented 5 months ago

Fixes #53.

The problem

Issue #53 is about adding support to Emoji characters, and the root cause of this issue is that the current YAML package doesn't handle surrogate pairs when parsing.

Why?

In the YAML spec, a character is a Unicode code point (quote Section 5.2: "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair."). And in Section 5.1, the spec defines a list of acceptable Unicode characters, with UTF-16 surrogates explicitly excluded.

On the other hand, this package utilizes the string_scanner package to do string parsing or scanning, especially its peekChar() and readChar() methods. And the consumed "character" (in int) will be passed to the _isStandardCharacter() method in scanner.dart to test whether this Unicode character is an allowed one:

https://github.com/dart-lang/yaml/blob/e5984433a2803d5c67ed0abac5891a55040381ee/lib/src/scanner.dart#L1594-L1598

The problem is that a character (or char) returned by peekChar() or readChar() in string_scanner is actually a "code unit", and the "character" expected by _isStandardCharacter() is a "code point". As a result, peekChar() and readChar() can return a surrogate (usually the higher one) that represents a paired code units, and the value will be immediately rejected by _isStandardCharacter().

The impact is that the package will reject all code points that requires a surrogate pairs if the string is not quoted. This doesn't only affect Emojis.

A note on my previous attempt

I had submitted a pull request at https://github.com/dart-lang/string_scanner/pull/69, hoping that it can ease fixing the issue of the yaml package. But later I feel increasing uncomfortable about changing a codepoint-centric method to accept an offset argument that act upon code units. So I asked to make that pull request on hold, and decided to present my idea here for feedback first.

What is in this pull request

This pull request doesn’t need any modifications to the string_scanner package as I’ve copied the required utility functions from it. Basically, I’ve implemented the following changes:

  1. I have replaced all instances of _scanner.readChar() with _scanner.readCodePoint(). Although _scanner.readChar() could still be used in some areas, I’ve chosen to maintain consistency to prevent potential confusion or unintentional bugs in the future.
  2. Instances that invoke _isStandardCharacter() have been modified to call the new method _isStandardCharacterAt(). This new method is responsible for checking and merging surrogate pairs, and then passing the actual codepoint/character to _isStandardCharacter().

Misc.

benchmark.dart output before and after the fix (in JIT mode):

Before

Run #1: 0.124ms ============
Run #3: 0.116ms ===========
Run #4: 0.111ms ===========
Run #5: 0.110ms ===========
Run #8: 0.107ms ==========
Run #9: 0.106ms ==========
Run #10: 0.105ms ==========
Run #11: 0.099ms =========
Run #12: 0.092ms =========
Run #15: 0.091ms =========
Run #17: 0.089ms ========
Run #23: 0.077ms =======
Run #29: 0.065ms ======
Run #30: 0.064ms ======
Run #32: 0.059ms =====
Run #34: 0.057ms =====
Run #40: 0.056ms =====
Best   : 0.056ms =====

After

Run #1: 0.110ms ===========
Run #3: 0.100ms ==========
Run #5: 0.090ms =========
Run #9: 0.082ms ========
Run #10: 0.072ms =======
Run #11: 0.070ms =======
Run #12: 0.064ms ======
Run #14: 0.062ms ======
Run #18: 0.061ms ======
Run #21: 0.059ms =====
Best   : 0.059ms =====
kevmoo commented 5 months ago

@devoncarew ? @natebosch ?

natebosch commented 5 months ago

@lrhn - should we be using package:characters or does this look sufficient?

lrhn commented 5 months ago

The YAML spec seems to be defined in terms of valid code points, which are almost all code points except specifically excluded ranges.

Grapheme clusters shouldn't be necessary to validate that. The excluded code points don't generally combine, so they're unlikely to be part of a larger grapheme cluster.

Code points should be enough.

tamcy commented 5 months ago

In the latest revision I'd changed back to use readChar() whenever possible. Also I have removed the need to decode the surrogate for further checking because a valid surrogate pair itself should be of the range between 0x10000 to 0x10FFFF, so just checking for the validaity of the high and low surrogate is enough.

kevmoo commented 3 months ago

Any interest in moving this forward?

kevmoo commented 1 month ago

Thanks so much, @tamcy !