firasdib / Regex101

This repository is currently only used for issue tracking for www.regex101.com
3.21k stars 198 forks source link

\p{} doesn't support all available properties by PCRE2 when testing through PHP #2155

Closed carenas closed 9 months ago

carenas commented 9 months ago

Bug Description

PCRE2 allows specifying a broad set of Unicode properties, but the website seems to imply that only scripts could be used and valid showing the following error:

\p{hex} This script is unknown/invalid

Reproduction steps

\p{hex}

Expected Outcome

correct match, verified to work at least with php 8.2

Browser

any

OS

any

working-name commented 9 months ago

Hi @carenas,

The regex engine is PCRE2, and php is an example of an implementation of such, but I don't think the main focus is to keep up with php's own implementation. I looked for \p{hex} in pcre2.txt but I couldn't find it ... yet? Maybe they'll implement that soon.

Probably not the answer you're looking for, though.

carenas commented 9 months ago

I looked for \p{hex} in pcre2.txt but I couldn't find it .

the list of properties depends on the Unicode data and is therefore not listed there, but you can get them from:

% pcre2test -LP   
-------------------------- SUPPORTED PROPERTIES --------------------------

This release of PCRE2 supports Unicode's general category properties such
as Lu (upper case letter), bi-directional properties such as Bidi_Class,
and the following binary (yes/no) properties:

asciihexdigit (ahex)                    hexdigit (hex)
alphabetic (alpha)                      idcontinue (idc)
ascii                                   ideographic (ideo)
bidicontrol (bidic)                     idstart (ids)
bidimirrored (bidim)                    idsbinaryoperator (idsb)
cased                                   idstrinaryoperator (idst)
caseignorable (ci)                      joincontrol (joinc)
changeswhencasefolded (cwcf)            logicalorderexception (loe)
changeswhencasemapped (cwcm)            lowercase (lower)
changeswhenlowercased (cwl)             math
changeswhentitlecased (cwt)             noncharactercodepoint (nchar)
changeswhenuppercased (cwu)             patternsyntax (patsyn)
dash                                    patternwhitespace (patws)
defaultignorablecodepoint (di)          prependedconcatenationmark (pcm)
deprecated (dep)                        quotationmark (qmark)
diacritic (dia)                         radical
emojimodifierbase (ebase)               regionalindicator (ri)
emojicomponent (ecomp)                  softdotted (sd)
emojimodifier (emod)                    sentenceterminal (sterm)
emoji                                   whitespace (space, wspace)
emojipresentation (epres)               terminalpunctuation (term)
extender (ext)                          unifiedideograph (uideo)
extendedpictographic (extpict)          uppercase (upper)
graphemebase (grbase)                   variationselector (vs)
graphemeextend (grext)                  xidcontinue (xidc)
graphemelink (grlink)                   xidstart (xids)

as for PHP internal version, as I mentioned earlier, I'd made sure that it includes that property and is therefore able to use it, hence why I was puzzled by the error message.

working-name commented 9 months ago

Got you, yeah it seems PCRE2 10.42 knows about \p{hex} but 10.34 does not for example. We might need to update the PCRE2 engine the site uses.

PCRE2 version 10.42 2022-12-11
  re> /\p{hex}+/g
------------------------------------------------------------------
  0   7 Bra
  3     prop Hexdigit ++
  7   7 Ket
 10     End
------------------------------------------------------------------
Capture group count = 0
Subject length lower bound = 1
data> abcdef
 0: abcdef
PCRE2 version 10.34 2019-11-21
  re> /\p{hex}+/g
Failed: error 147 at offset 7: unknown property name after \P or \p
  re>

What I meant to say is that sometimes languages implement other things in a regex engine even though they're based on a specific standard. So PHP could implement some other oddity around PCRE2, of which the official PCRE2 engine does not know. Not the case here, though.

firasdib commented 9 months ago

Thanks, I'll add support for these scripts.