Closed tamcy closed 1 month ago
@devoncarew ? @natebosch ?
@lrhn - should we be using package:characters
or does this look sufficient?
The YAML spec seems to be defined in terms of valid code points, which are almost all code points except specifically excluded ranges.
Grapheme clusters shouldn't be necessary to validate that. The excluded code points don't generally combine, so they're unlikely to be part of a larger grapheme cluster.
Code points should be enough.
In the latest revision I'd changed back to use readChar()
whenever possible. Also I have removed the need to decode the surrogate for further checking because a valid surrogate pair itself should be of the range between 0x10000
to 0x10FFFF
, so just checking for the validaity of the high and low surrogate is enough.
Any interest in moving this forward?
Thanks so much, @tamcy !
Fixes #53.
The problem
Issue #53 is about adding support to Emoji characters, and the root cause of this issue is that the current YAML package doesn't handle surrogate pairs when parsing.
Why?
In the YAML spec, a character is a Unicode code point (quote Section 5.2: "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair."). And in Section 5.1, the spec defines a list of acceptable Unicode characters, with UTF-16 surrogates explicitly excluded.
On the other hand, this package utilizes the
string_scanner
package to do string parsing or scanning, especially itspeekChar()
andreadChar()
methods. And the consumed "character" (inint
) will be passed to the_isStandardCharacter()
method inscanner.dart
to test whether this Unicode character is an allowed one:https://github.com/dart-lang/yaml/blob/e5984433a2803d5c67ed0abac5891a55040381ee/lib/src/scanner.dart#L1594-L1598
The problem is that a character (or char) returned by
peekChar()
orreadChar()
instring_scanner
is actually a "code unit", and the "character" expected by_isStandardCharacter()
is a "code point". As a result,peekChar()
andreadChar()
can return a surrogate (usually the higher one) that represents a paired code units, and the value will be immediately rejected by_isStandardCharacter()
.The impact is that the package will reject all code points that requires a surrogate pairs if the string is not quoted. This doesn't only affect Emojis.
A note on my previous attempt
I had submitted a pull request at https://github.com/dart-lang/string_scanner/pull/69, hoping that it can ease fixing the issue of the yaml package. But later I feel increasing uncomfortable about changing a codepoint-centric method to accept an
offset
argument that act upon code units. So I asked to make that pull request on hold, and decided to present my idea here for feedback first.What is in this pull request
This pull request doesn’t need any modifications to the
string_scanner
package as I’ve copied the required utility functions from it. Basically, I’ve implemented the following changes:_scanner.readChar()
with_scanner.readCodePoint()
. Although_scanner.readChar()
could still be used in some areas, I’ve chosen to maintain consistency to prevent potential confusion or unintentional bugs in the future._isStandardCharacter()
have been modified to call the new method_isStandardCharacterAt()
. This new method is responsible for checking and merging surrogate pairs, and then passing the actual codepoint/character to _isStandardCharacter().Misc.
benchmark.dart output before and after the fix (in JIT mode):
Before
After