PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

Support W3C xpath spec. weird newline behavior regarding to DOTALL and MULTILINE #409

Closed cohomology closed 6 months ago

cohomology commented 6 months ago

The XPATH and XSD specification contains a regular expression grammar. It is difficult, but possible to parse those expressions and transform them to PCRE (they have some features not contained in PCRE, for example character set subtraction). We have a C++ library which does exactly that. To match the regular expressions we use PCRE2 at runtime.

XPATH contains a matches() and replace() Function to match those expressions. These also have a "flags" specification.

See https://www.w3.org/TR/xpath-functions-30/#flags

It is stated there:

s: If present, the match operates in "dot-all" mode. (Perl calls this the single-line mode.) If the s flag is not specified, the meta-character . matches any character except a newline (#x0A) or carriage return (#x0D) character. ...

m: If present, the match operates in multi-line mode ... newline here means the character #x0A only.

So from my understanding in XPATH, by default, the "." matches everything except carriage return and line feed, but the multiline mode $ would not match before CR.

So the W3C has different "newlines" whether multiline or dotall is considered. For DOTALL they consider carriage return, for multiline they only consider line feed.

Do you have a clue why W3C did this choice? Is it possible to implement two different sets of newlines for PCRE2_DOTALL and PCRE2_MULTILINE?

I would also contribute some code, but wanted to know what you think about this beforehand.

W3C seems important enough to think about including their behavior in PCRE2 ;-).

PhilipHazel commented 6 months ago

PCRE2 already has support for different "newlines", but this does not change when PCRE2_MULTILINE is set. You can choose between CR (only), LF (only), CR+LF (i.e. two characters), any of the previous, any Unicode newline sequence, or NUL. A default can be set when PCRE2 is built, but this can be overridden by a function call and this in turn can be overridden within the pattern string. If you were to set ANYCRLF as the newline, it would almost agree with your "not s" mode, except that a CR followed by a LF would count as just one newline, not two. It sounds as if you have full control over the regex. In that case, when you are going to set the "m" option, you could also set LF as the only linefeed. So my suggestion is:

Default: start the pattern with (*ANYCRLF) which will give you correct "." behaviour, that is, "." will not match CR or LF.

If the "s" option is wanted, start the pattern with (?s) and "." will match any character.

If the "m" option is wanted, start the pattern with (*LF)(?m) and "." will match any except LF.

If both options are wanted, start with (*LF)(?ms).

That seems to me to give you the wanted behaviour, except that in the default case CRLF counts as just one newline. Making PCRE2 recognize either CR or LF as a newline, but treat CRLF as two newlines would require a new newline mode.

cohomology commented 6 months ago

The problem with your approach is, that in "m" mode without "s" you want to set (*LF)(?m), and a single dot will match CR, which it shouldn't according to the standard. The problem you mentioned at the end is also present.

My colleague suggested:

Always set (*LF)

Transform the regex that "." is never generated but:

no "s" mode: dot is transformed to [^\r\n] in "s" mode: dot is transformed to [\s\S]

Would that work?

PhilipHazel commented 6 months ago

It's somewhat inconsistent to have "." not match CR or LF while at the same time only recognizing LF as newline. However, I think your approach would work, though for "s" mode you could just set PCRE2_DOTALL (or (?s)) which would be more efficient.

cohomology commented 6 months ago

Yeah, thanks!

I really want to know why W3C decided to do it that way. The XML people should be very clever, shouldn't they? Do you have a clue?

The replacement syntax (i.e. substitute) is even more difficult to get used to. They don't accept ${num}, only $num, and if there are only 22 groups, then $223 will be equivalent to PCRE's ${22}3. Replacing by ${2}23 is impossible in this case in XPATH.

PhilipHazel commented 6 months ago

Who knows? Reading the doc suggests to me that they thought about "^" and "$" completely separately from "." whereas PCRE ties them all to the concept of a logical "newline". The replacement rules seem totally weird.

cohomology commented 6 months ago

Thanks!