PCRE2Project / pcre2

PCRE2 development is now based here.
Other
921 stars 194 forks source link

Quantifier `a{,7}` not supported #422

Closed david-wahlstedt closed 5 months ago

david-wahlstedt commented 5 months ago

I have tried various examples of patterns containing braces and quantifiers, trying to figure out what is legal and not. I noticed that the a{,7} variant is taken as a literal string and just matches itself. But according to the man page, it should match from zero to seven a's.

Here are some examples I tried, and how pcre2test behaves with them (PCRE2 version 10.39 2021-10-29, Linux x64, Ubuntu 22.04):

  |---------+-----------|
  | pattern | match     |
  |---------+-----------|
  | a{7,}   | aaaaaaa   |
  | a{,7}   | literally |
  |---------+-----------|
  | a{b,c}  | literally |
  | }a{5},{ | }aaaaa,{  |
  | ,{,     | literally |
  | a{7, 8} | literally |
  | a{,}    | literally |
  |---------+-----------|

Should fail (and does): {|{5}

 pcre2test
 PCRE2 version 10.39 2021-10-29
 /{|{5}/debug
 Failed: error 109 at offset 4: quantifier does not follow a repeatable item

Should pass (no syntax error, but I can't match anything on it): {\|{5} Shouldn't this match literally, is it a bug?

PCRE2 version 10.39 2021-10-29
/{\|{5}/debug
------------------------------------------------------------------
  0   9 Bra
  3     {
  5     |{5}
  9   9 Ket
 12     End
------------------------------------------------------------------
Capture group count = 0
First code unit = '{'
Last code unit = '|'
Subject length lower bound = 6
{|{5}
No match

Now my question is, on top of the fact that the a{,7} should be supported according to the man page, considering various patterns with braces, sometimes when parsing them as quantifiers fails, they are treated as a literal string, and sometimes they give an error:

I am trying to find an accurate description of the syntax, but I haven't found it. The reason is that I am trying to write a parser for (a decent fragment of) PCRE2 in Haskell, and I want to base the parser on a grammar description, preferrably some form of BNF. I have found one grammar for ANTLRv4, namely https://github.com/bkiers/pcre-parser It is an approximation and work in progress, but it supports most of it.

The pcre2test should be the reference I guess, since it is the implementation. But a formal grammar would be nice, and I am willing to contribute to such, in one way or another, if it is of interest.

Best regards, David

carenas commented 5 months ago

I noticed that the a{,7} variant is taken as a literal string and just matches itself. But according to the man page, it should match from zero to seven a's.

versions before 10.43 treated this as a literal, it was changed then to match perl (see #298)

the documentation includes pcre2pattern and pcre2syntax as guides but ultimately the definition of what is to be expected (specially on the edges) comes from Perl

david-wahlstedt commented 5 months ago

I noticed that the a{,7} variant is taken as a literal string and just matches itself. But according to the man page, it should match from zero to seven a's.

versions before 10.43 treated this as a literal, it was changed then to match perl (see #298)

the documentation includes pcre2pattern and pcre2syntax as guides but ultimately the definition of what is to be expected (specially on the edges) comes from Perl

Thanks! Ok , I see. What I find most difficult now is to tell what should be literal and what should be errors. David