Open ProvocaTeach opened 2 years ago
man pcre
says it has to be valid Unicode points, but that range have a bunch of invalid ones:
julia> count(x -> !isvalid(Char(x)), 0x00A0:0x10FFFD)
2048
In any case, if this is indeed a bug it is a bug in PCRE2 and not Julia.
Unfortunately, the bug applies to all Unicode ranges, not just ones with invalid characters. Even simply typing
julia> r"[\x{00A0}-\x{00A5}]"
throws a LoadError
:
ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] compile(pattern::String, options::UInt32)
@ Base.PCRE ./pcre.jl:155
[3] compile(regex::Regex)
@ Base ./regex.jl:82
[4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
@ Base ./regex.jl:47
[5] Regex(pattern::String)
@ Base ./regex.jl:70
[6] var"@r_str"(__source__::LineNumberNode, __module__::Module, pattern::Any, flags::Vararg{Any})
@ Base ./regex.jl:119
in expression starting at REPL[94]:1
Please reopen this issue?
It is still an error from PCRE, but this one doesn't show up in other environments (e.g. https://regex101.com/) so perhaps there is some compile setting that is different.
Workaround: \N{U+XXXX}
The escape sequence \N{U+
} is recognized as another way of specifying a Unicode character by code point in a UTF mode. https://www.pcre.org/current/doc/html/pcre2unicode.html
julia> '和'
'和': Unicode U+548C (category Lo: Letter, other)
julia> '平'
'平': Unicode U+5E73 (category Lo: Letter, other)
julia> contains("和", r"[\N{U+548C}-\N{U+5E73}]")
true
julia> contains("平", r"[\N{U+548C}-\N{U+5E73}]")
true
julia> contains("aaa", r"[\N{U+548C}-\N{U+5E73}]")
false
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
SOLUTION: The command
julia> r"[\x{00A0}-\x{10FFFD}]"
is short for
julia> using Base.PCRE
julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.ALT_BSUX | PCRE.UCP,
PCRE.NO_UTF_CHECK)
ERROR: PCRE compilation error: range out of order in character class at offset 11
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] compile(pattern::String, options::UInt32)
@ Base.PCRE ./pcre.jl:155
[3] compile(regex::Regex)
@ Base ./regex.jl:82
[4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
@ Base ./regex.jl:47
[5] top-level scope
@ REPL[4]:1
In the latter form, we can play with the compile and match option flags that are passed to the PCRE2 library to specify what flavour of regular-expression behaviour exactly we want.
Doing that, I quickly found that dropping the PCRE.ALT_BSUX
compile option suppresses this compilation error:
julia> Regex(raw"[\x{00A0}-\x{10FFFD}]",
PCRE.UTF | PCRE.MATCH_INVALID_UTF | PCRE.UCP,
PCRE.NO_UTF_CHECK)
Regex("[\\x{00A0}-\\x{10FFFD}]",0x040a0000)
Now it is time to actually read the PCRE2 documentation:
man pcre2
man pcre2api
There we find indeed the answer:
PCRE2_ALT_BSUX
This option request alternative handling of three escape sequences,
which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
When it is set:
(1) \U matches an upper case "U" character; by default \U causes a com‐
pile time error (Perl uses \U to upper case subsequent characters).
(2) \u matches a lower case "u" character unless it is followed by four
hexadecimal digits, in which case the hexadecimal number defines the
code point to match. By default, \u causes a compile time error (Perl
uses it to upper case the following character).
(3) \x matches a lower case "x" character unless it is followed by two
hexadecimal digits, in which case the hexadecimal number defines the
code point to match. By default, as in Perl, a hexadecimal number is
always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z).
ECMAscript 6 added additional functionality to \u. This can be accessed
using the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile op‐
tions" below). Note that this alternative escape handling applies only
to patterns. Neither of these options affects the processing of re‐
placement strings passed to pcre2_substitute().
In other words, Julia asks PCRE2 to implement a slightly more JavaScript-compatible version of regular expressions than the more Perl-compatible flavor it would have given us by default. The man page doesn't explicitly say so, but the way I read it, \x{xxxx}
seems not part of the ECMAscript syntax, and is in fact therefore identical to just x{xxxx}
. So in other words, you get the same error with
julia> r"[x{00A0}-x{10FFFD}]"
ERROR: LoadError: PCRE compilation error: range out of order in character class at offset 9
And it suddenly all makes sense, because }-x
is indeed an out-of-order range.
I guess that choice in favour of ECMAscript syntax for \u
, \U
and \x
warrants to be examined, justified, and documented. (Ideally, I think the Julia manual should contain a self-contained reference of the regular-expression syntax supported.)
So this is clearly not a bug in the PCRE2 C library, but at least an omission in the Julia manual.
Digging through the commit history of where the choice of JavaScript-compatible \x\u\U
in Julia regular expressions via PCRE.ALT_BSUX
came from:
PCRE.JAVASCRIPT_COMPAT
with PCRE2 option PCRE.ALT_BSUX
while upgrading from PCRE to PCRE2, i.e. this seems to be just adjusting to the new APIPCRE.JAVASCRIPT_COMPAT
to “fix r"\u2220" bug mentioned in #107”The latter commit was made by @nolta as a “band-air”.
String literals, macro/raw string literals and the resulting differences in quote and backslash escaping clearly had a rather tortuous history in the evolution of Julia. Note that at no point in issue #107 is there any discussion about whether Julia's flavour of PCRE should be more like Perl or more like JavaScript. The choice of the JavaScript variant just happened to cause one error message in one example to disappear, if I understood that discussion correctly.
They wanted match(r"\u2200", "\u2200")
to match, whereas in Perl-compatible regular-expression syntax it would have had to be match(r"\x{2200}", "\u2200")
because in Perl RE, \u
means “lowercase the next letter”. Note that in this example, the first \u
is interpreted by PCRE2, whereas the second is part of Julia's string literal syntax. They are not the same syntax, but just happen to overlap in this particular example, whereas e.g. a slight variant such as match(r"\U102200", "\U102200")
does not match.
Trying to form Unicode hex ranges in a regular expression causes a
LoadError
:yields
The result should be a regex that matches all Unicode codepoints from
U+00A0
toU+10FFFD
. Julia version: 1.7.3