Closed carenas closed 9 months ago
Hi @carenas,
The regex engine is PCRE2, and php is an example of an implementation of such, but I don't think the main focus is to keep up with php's own implementation. I looked for \p{hex}
in pcre2.txt but I couldn't find it ... yet? Maybe they'll implement that soon.
Probably not the answer you're looking for, though.
I looked for
\p{hex}
in pcre2.txt but I couldn't find it .
the list of properties depends on the Unicode data and is therefore not listed there, but you can get them from:
% pcre2test -LP
-------------------------- SUPPORTED PROPERTIES --------------------------
This release of PCRE2 supports Unicode's general category properties such
as Lu (upper case letter), bi-directional properties such as Bidi_Class,
and the following binary (yes/no) properties:
asciihexdigit (ahex) hexdigit (hex)
alphabetic (alpha) idcontinue (idc)
ascii ideographic (ideo)
bidicontrol (bidic) idstart (ids)
bidimirrored (bidim) idsbinaryoperator (idsb)
cased idstrinaryoperator (idst)
caseignorable (ci) joincontrol (joinc)
changeswhencasefolded (cwcf) logicalorderexception (loe)
changeswhencasemapped (cwcm) lowercase (lower)
changeswhenlowercased (cwl) math
changeswhentitlecased (cwt) noncharactercodepoint (nchar)
changeswhenuppercased (cwu) patternsyntax (patsyn)
dash patternwhitespace (patws)
defaultignorablecodepoint (di) prependedconcatenationmark (pcm)
deprecated (dep) quotationmark (qmark)
diacritic (dia) radical
emojimodifierbase (ebase) regionalindicator (ri)
emojicomponent (ecomp) softdotted (sd)
emojimodifier (emod) sentenceterminal (sterm)
emoji whitespace (space, wspace)
emojipresentation (epres) terminalpunctuation (term)
extender (ext) unifiedideograph (uideo)
extendedpictographic (extpict) uppercase (upper)
graphemebase (grbase) variationselector (vs)
graphemeextend (grext) xidcontinue (xidc)
graphemelink (grlink) xidstart (xids)
as for PHP internal version, as I mentioned earlier, I'd made sure that it includes that property and is therefore able to use it, hence why I was puzzled by the error message.
Got you, yeah it seems PCRE2 10.42 knows about \p{hex}
but 10.34 does not for example. We might need to update the PCRE2 engine the site uses.
PCRE2 version 10.42 2022-12-11
re> /\p{hex}+/g
------------------------------------------------------------------
0 7 Bra
3 prop Hexdigit ++
7 7 Ket
10 End
------------------------------------------------------------------
Capture group count = 0
Subject length lower bound = 1
data> abcdef
0: abcdef
PCRE2 version 10.34 2019-11-21
re> /\p{hex}+/g
Failed: error 147 at offset 7: unknown property name after \P or \p
re>
What I meant to say is that sometimes languages implement other things in a regex engine even though they're based on a specific standard. So PHP could implement some other oddity around PCRE2, of which the official PCRE2 engine does not know. Not the case here, though.
Thanks, I'll add support for these scripts.
Bug Description
PCRE2 allows specifying a broad set of Unicode properties, but the website seems to imply that only scripts could be used and valid showing the following error:
\p{hex} This script is unknown/invalid
Reproduction steps
\p{hex}
Expected Outcome
correct match, verified to work at least with php 8.2
Browser
any
OS
any