Regex bug: Unicode hex ranges not supported

ProvocaTeach commented 2 years ago

Trying to form Unicode hex ranges in a regular expression causes a LoadError:

julia> r"[\x{00A0}-\x{10FFFD}]"

yields

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[45]:1

The result should be a regex that matches all Unicode codepoints from U+00A0 to U+10FFFD. Julia version: 1.7.3

fredrikekre commented 2 years ago

man pcre says it has to be valid Unicode points, but that range have a bunch of invalid ones:

julia> count(x -> !isvalid(Char(x)), 0x00A0:0x10FFFD)
2048

In any case, if this is indeed a bug it is a bug in PCRE2 and not Julia.

ProvocaTeach commented 2 years ago

Unfortunately, the bug applies to all Unicode ranges, not just ones with invalid characters. Even simply typing

julia> r"[\x{00A0}-\x{00A5}]"

throws a LoadError:

ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] Regex(pattern::String)
   @ Base ./regex.jl:70
 [6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
   @ Base ./regex.jl:119
in expression starting at REPL[94]:1

Please reopen this issue?

fredrikekre commented 2 years ago

It is still an error from PCRE, but this one doesn't show up in other environments (e.g. https://regex101.com/) so perhaps there is some compile setting that is different.

inkydragon commented 2 years ago

Workaround: \N{U+XXXX}

The escape sequence \N{U+} is recognized as another way of specifying a Unicode character by code point in a UTF mode. https://www.pcre.org/current/doc/html/pcre2unicode.html

julia> '和'
'和': Unicode U+548C (category Lo: Letter, other)

julia> '平'
'平': Unicode U+5E73 (category Lo: Letter, other)

julia> contains("和", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("平", r"[\N{U+548C}-\N{U+5E73}]")
true

julia> contains("aaa", r"[\N{U+548C}-\N{U+5E73}]")
false

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

mgkuhn commented 2 years ago

SOLUTION: The command

julia> r"[\x{00A0}-\x{10FFFD}]"

is short for

julia> using Base.PCRE
julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
             PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.ALT_BSUX | PCRE.UCP,
             PCRE.NO_UTF_CHECK)
ERROR: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compile(pattern::String, options::UInt32)
   @ Base.PCRE ./pcre.jl:155
 [3] compile(regex::Regex)
   @ Base ./regex.jl:82
 [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
   @ Base ./regex.jl:47
 [5] top-level scope
   @ REPL[4]:1

In the latter form, we can play with the compile and match option flags that are passed to the PCRE2 library to specify what flavour of regular-expression behaviour exactly we want.

Doing that, I quickly found that dropping the PCRE.ALT_BSUX compile option suppresses this compilation error:

julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
                    PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.UCP,
                    PCRE.NO_UTF_CHECK)
Regex("[\\x{00A0}-\\x{10FFFD}]",0x040a0000)

Now it is time to actually read the PCRE2 documentation:

man pcre2
man pcre2api

There we find indeed the answer:

         PCRE2_ALT_BSUX

       This  option  request  alternative  handling of three escape sequences,
       which makes PCRE2's behaviour more like  ECMAscript  (aka  JavaScript).
       When it is set:

       (1) \U matches an upper case "U" character; by default \U causes a com‐
       pile time error (Perl uses \U to upper case subsequent characters).

       (2) \u matches a lower case "u" character unless it is followed by four
       hexadecimal  digits,  in  which case the hexadecimal number defines the
       code point to match. By default, \u causes a compile time  error  (Perl
       uses it to upper case the following character).

       (3)  \x matches a lower case "x" character unless it is followed by two
       hexadecimal digits, in which case the hexadecimal  number  defines  the
       code  point  to  match. By default, as in Perl, a hexadecimal number is
       always expected after \x, but it may have zero, one, or two digits (so,
       for example, \xz matches a binary zero character followed by z).

       ECMAscript 6 added additional functionality to \u. This can be accessed
       using the PCRE2_EXTRA_ALT_BSUX extra option  (see  "Extra  compile  op‐
       tions" below).  Note that this alternative escape handling applies only
       to patterns. Neither of these options affects  the  processing  of  re‐
       placement strings passed to pcre2_substitute().

In other words, Julia asks PCRE2 to implement a slightly more JavaScript-compatible version of regular expressions than the more Perl-compatible flavor it would have given us by default. The man page doesn't explicitly say so, but the way I read it, \x{xxxx} seems not part of the ECMAscript syntax, and is in fact therefore identical to just x{xxxx}. So in other words, you get the same error with

julia> r"[x{00A0}-x{10FFFD}]"
ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 9

And it suddenly all makes sense, because }-x is indeed an out-of-order range.

I guess that choice in favour of ECMAscript syntax for \u, \U and \x warrants to be examined, justified, and documented. (Ideally, I think the Julia manual should contain a self-contained reference of the regular-expression syntax supported.)

So this is clearly not a bug in the PCRE2 C library, but at least an omission in the Julia manual.

mgkuhn commented 2 years ago

Digging through the commit history of where the choice of JavaScript-compatible \x\u\U in Julia regular expressions via PCRE.ALT_BSUX came from:

afa14048a5551fec9ee79f9472a16fb10e260b56 in Jan 2015 replaced PCRE compile option PCRE.JAVASCRIPT_COMPAT with PCRE2 option PCRE.ALT_BSUX while upgrading from PCRE to PCRE2, i.e. this seems to be just adjusting to the new API
7909e3de17f682d934c1d418a98111e357515400 in Mar 2013 added PCRE.JAVASCRIPT_COMPAT to “fix r"\u2220" bug mentioned in #107”

The latter commit was made by @nolta as a “band-air”.

String literals, macro/raw string literals and the resulting differences in quote and backslash escaping clearly had a rather tortuous history in the evolution of Julia. Note that at no point in issue #107 is there any discussion about whether Julia's flavour of PCRE should be more like Perl or more like JavaScript. The choice of the JavaScript variant just happened to cause one error message in one example to disappear, if I understood that discussion correctly.

They wanted match(r"\u2200", "\u2200") to match, whereas in Perl-compatible regular-expression syntax it would have had to be match(r"\x{2200}", "\u2200") because in Perl RE, \u means “lowercase the next letter”. Note that in this example, the first \u is interpreted by PCRE2, whereas the second is part of Julia's string literal syntax. They are not the same syntax, but just happen to overlap in this particular example, whereas e.g. a slight variant such as match(r"\U102200", "\U102200") does not match.

JuliaLang / julia

Regex bug: Unicode hex ranges not supported #46137