Open slevithan opened 2 weeks ago
Thank you!
One open question I have is whether all enclosing marks are allowed where...
What do you recommend?
Used interpolation to not repeat the byte subpattern.
I had implemented this the same way before, but some bundlers had trouble tree-shaking it and including the regex in every bundle, whether used or not. So I repeated it. Due to compression it is probably also better in terms of bundle size.
One open question I have is whether all enclosing marks are allowed where...
What do you recommend?
I recommend using it as written, with \p{Me}
instead of \u20E3
. It won't create problems and might allow for edge-case emoji I can't think of or future emoji. But I will follow up with additional investigation.
Used interpolation to not repeat the byte subpattern.
I had implemented this the same way before, but some bundlers had trouble tree-shaking it and including the regex in every bundle, whether used or not. So I repeated it. Due to compression it is probably also better in terms of bundle size.
Fair enough, if bundle size trumps readability. Another solution would be to use the regex
package since its subroutines syntax allows subpattern composition without interpolation. But I'm guessing you won't think the dependency (despite being lightweight) is the right tradeoff since you currently advertise as having no dependencies.
Thanks you for your feedback! It may take some time to finish this PR as other things have a higher priority at the moment.
I've done a lot more research on the emoji regex and published an improved/fixed version as emoji-regex-xs
. I'll update the PR here shortly, and also remove all the use of composition in the regexes since it sounds like you'd prefer to avoid that.
Thank you very much! I will probably review and merge this PR next week.
OK, updated. Passes tests and lint.
In the last-added commit, I updated the comment about RGI_Emoji
, pointing out that it's not a better replacement, even in the future when targeting Node.js 20+. The reasons are discussed in the emoji-regex-xs
readme, but I'll also describe them here.
To clarify what the emoji regex in this PR matches:
/\p{RGI_Emoji}/v
.On that last point, unfortunately, some common-sense and broadly-supported emoji are not officially in the "RGI" list. Even the Unicode org provides emoji-test.txt that mixes in non-RGI emoji strings to help identify real-world emoji. And some emoji are commonly used in an underqualified or overqualified way (by including or excluding certain invisible Unicode markers) that prevents them from being matched by RGI_Emoji
. For example, the iOS emoji keyboard overqualifies certain emoji. So we need something that matches everything in RGI_Emoji
, and more.
The regex here allows overqualified and underqualified emoji using a general pattern that matches all Unicode sequences that follow the structure of valid emoji.
Thank you for your research! Is the new emoji regex more strict or accurate than the old one?
Yes. If by "the old one" you meant my own iterations, this is the final version based on research in depth, and I've now therefore also published it as its own library (emoji-regex-xs). If there are changes to emoji-regex-xs in the future (e.g., if new versions of Unicode modify the general patterns for emoji), it can easily be updated by anyone, by simply copying the pattern from future versions of emoji-regex-xs and wrapping it in ^(?:…)+$
. The wrapping pattern is because Valibot's emoji regex matches one or more emoji, as the whole string.
This library shares the API and 3,000+ tests with emoji-regex. emoji-regex is very large (13 KB uncompressed), but it is authoritative (its author helped add things like \p{}
, RGI_Emoji
, and /v
to JS regexes) and extremely broadly used. emoji-regex now also recommends and crosslinks emoji-regex-xs in its readme, for people who want a general pattern (not tied to a specific version of Unicode) that can therefore be much more lightweight.
If by "the old one" you meant the regex currently used in the code that this PR replaces, then yes, this PR is both more strict and more accurate.
Here is the current regex that is being replaced:
/^[\p{Extended_Pictographic}\p{Emoji_Component}]+$/u
This is extremely wrong. I'm assuming you picked it up from Zod, which uses the same thing, and I can find other people posting it online, which is where Zod probably got it from. It presumably has spread virally because:
\p{Emoji_Component}
match, or the full variation of different Unicode pattern sequences needed to match the full set of emoji, much less to also deal with real-world data (where some additional variation is added).\p{RGI_Emoji}
, but that has some problems as well which is why both emoji-regex and emoji-regex-xs match supersets of what it matches.There are two big problems with the regex that this PR replaces:
\p{Emoji_Component}
means it cannot possibly identify where where one emoji ends and the next starts. It only works at all because Valibot's requirement for this regex is to match one or more back-to-back emoji.
regex.ts
match one or more of their target. I suspect (without evidence) that this anomaly started because this pattern simply can't be modified to accurately match a single, complete emoji, and the authors of Zod couldn't find a reliable yet short alternative for doing so.Regarding the first problem, I already mentioned some of its false positives in earlier comments, but here are some additional details (not comprehensive):
*
(U+002A) and #
(U+0023).0
, 1
, 234
, etc.).✈
(U+2708).
✈️
(U+2708 U+FE0F)./[\u{1F1E6}-\u{1F1FF}\u{E0030}-\u{E0039}\u{E0061}-\u{E007A}]/u
). These can be composed into emoji flags and (with additional markers) subdivision flags, but they are not emoji on their own.The emoji regex in this PR fixes all of these issues.
Thanks again for your research and detailed answer! I thought it would be the best DX if emoji
could be combined with minLength
, maxLength
and length
to control the number of emoji. Unfortunately, this does not work because emoji can have different lengths. An alternative would be to control this via the arguments of emoji
and dynamically bake it into the regex. What do you think is the best solution?
I'm not familiar with Valibot's APIs. Could minLength
, maxLength
, and length
not be sensibly used to refer to the number of discrete emoji to match, rather than the number of code points (returned by [...str].length
) or code units (str.length
) in the matched string? If they could, then both of the APIs you described sound reasonable to me, but I might be more opinionated if you showed code examples of how the two approaches would be used.
Certainly, with the new emoji regex, something like this could easily be done. You'd just need to change the +
quantifier in the ^(?:…)+$
wrapper to whatever you want (e.g., {1}
, {5}
, {2,17}
, or {1,}
). It cannot be done with the current (pre-PR) emoji regex, for reasons explained in my last comment.
But this seems like something for a follow-up issue. I'd prefer to land this PR as is and for new functionality to be added afterward.
PS: The labels for this PR should include bug
.
But this seems like something for a follow-up issue. I'd prefer to land this PR as is and for new functionality to be added afterward.
I agree.
[...] be sensibly used to refer to the number of discrete emoji to match
Any ideas on how to implement this? Maybe we could add something like a minChar
, maxChar
and char
action.
Okay, I looked at valibot.dev/api/string/ and valibot.dev/api/emoji/ to understand a bit more what you're referring to.
I agree that minLength
, maxLength
, and length
must continue to refer to JavaScript's UTF-16 code units.
The term "character" is very overloaded so I'd advise against using char
, etc. I think the concept you're looking for is "grapheme clusters" or "extended grapheme clusters", which Unicode also describes as "user-perceived characters".
A concrete example is the emoji '👩🏻👩🏻👦🏻👦🏻'. Unicode Name: Family - Woman: Light Skin Tone, Woman: Light Skin Tone, Boy: Light Skin Tone, Boy: Light Skin Tone.
Edit: This might not actually be a great example because it's only currently rendered as a single user-perceived character on Microsoft and Facebook platforms, but if you're not on Windows, imagine it rendered like this.
// Code unit length
'👩🏻👩🏻👦🏻👦🏻'.length;
// → 19
// Each astral code point (above U+FFFF) is divided into high and low surrogates
// Code point length
[...'👩🏻👩🏻👦🏻👦🏻'].length;
// → 11
// These are: U+1F469 U+1F3FB U+200D U+1F469 U+1F3FB U+200D U+1F466 U+1F3FB U+200D U+1F466 U+1F3FB
// Grapheme length
// I don't think there's a native JS method to count this, but there are at least 4 graphemes:
// (U+1F469 U+1F3FB) (U+1F469 U+1F3FB) (U+1F466 U+1F3FB) (U+1F466 U+1F3FB)
// Not sure whether ZWJ (U+200D) also counts as a grapheme, is a non-grapheme, or gets included with graphemes
// Grapheme cluster length
[...new Intl.Segmenter().segment('👩🏻👩🏻👦🏻👦🏻')].length;
// → 1
Intl.Segmenter
is in Node.js 16+ (browser support). The JS library orling/grapheme-splitter extends support backward.
Since Intl.Segmenter
's documentation uses "grapheme" to refer to grapheme clusters, I think it's okay to use grapheme
the same way (even though it actually segments clusters of graphemes), especially if Intl.Segmenter
is used as the implementation.
I think minGraphemeLength
, maxGraphemeLength
, and graphemeLength
would be significantly preferable to controlling this via the arguments of emoji
, since counting grapheme clusters is broadly useful. E.g. the Spanish letter ñ can be represented as either '\u00F1'
or '\u006E\u0303'
. So either 1 or 2 code points, but 1 grapheme (and 1 grapheme cluster).
Just realized that my example emoji '👩🏻👩🏻👦🏻👦🏻' currently only renders as a single user-perceived character on Microsoft and Facebook platforms. Since other platforms don't currently include a unique emoji design for it, they might render it as four discrete emoji characters in a row (one for each of the graphemes in the grapheme cluster), possibly while still selecting it as a single character.
I didn't audit all the regexes. Just the specific ones described below.
EMOJI_REGEX
0
, etc.),*
, bare U+200D (ZWJ), and some symbols like👁
,✈
,🏳
, and♂
even when they're not followed by U+FE0F (none of which should match). So I fixed it.\p{Me}
, or if it's a more limited set like just U+20E3. The segment in question (\p{Emoji}\uFE0F\p{Me}?
) is used to match emoji like2️⃣
which is made up of 3 code points: U+32 U+FE0F U+20E3.HEXADECIMAL_REGEX
0h|0x
with0[hx]
.IPV4_REGEX
(?:(?:[1-9]|1\d|2[0-4])?\d|25[0-5])
with(?:2[0-4]\d|25[0-5]|1\d\d|[1-9]?\d)
. IMO it's easier to read without the nested grouping, and it's the same length. Then replaced\d\d
with\d{2}
to work around an eslint error that I don't agree with.IPV6_REGEX
IP_REGEX