PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

`pcre2grep -M` with anchored pattern matches not on the whole input - or only once #363

Closed calestyo closed 11 months ago

calestyo commented 11 months ago

Hey.

Not sure if this is a bug or I'm just misunderstanding something: Consider e.g.:

$ printf 'a\nb\nc\na\nb\nc\n' | pcre2grep -M '^a\nb\nc\n$'
a
b
c
$ 

I would have expected to -M '^a\nb\nc\n$' to either match only such input, that contains only a\nb\nc\n or at least to repeatedly match this "sequence of lines" - but in principle I'd expect the former.

Doing the same with GNU grep (with PCRE compiled in):

$ printf 'a\nb\nc\na\nb\nc\n' | grep -P -z '^a\nb\nc\n$'
$ 

indeed produces no match, however, e.g.:

$ printf 'a\nb\nc\na\nb\nc\n' | grep -P -z '^a\nb\nc\n'
a   <-- coloured
b   <-- coloured
c   <-- coloured
a
b
c

with the first three lines being coloured in order to represent the match.

Any ideas?

Thanks. Chris.

PS: using pcre2-utils 10.42-4 from Debian.

PhilipHazel commented 11 months ago

-M sets the PCRE2_MULTILINE option, which behaves like Perl's /m option. This means that $ can match internal newlines (and ^ can match after them) which is why you just get the one match, at the start of the string. If you replace $ with \z you get a match at the end of the string. If you also replace ^ with \a you get no match. I will try to add some words to that effect to the documentation for pcre2grep.

calestyo commented 11 months ago

Ah I see.

So this is actually because -M is not like -0 with grep, where the later causes the whole input (if it doesn’t contain any 0x0) to be considered as the matched subject/line, whereas -M just matches multiple lines.

Well in principle it might be already clear enough in the documentation and the problem was just that PCRE is so overwhelmingly complex ;-)

What IMO could actually help is clarifying the following from the CHARACTERS AND METACHARACTERS:

         ^      assert start of string (or line, in multiline mode)
         $      assert end of string (or line, in multiline mode)

I mean, in retrospective it's clear to me, that because of -M, it's really lines of a string and not string (as in: the whole input), but perhaps it would help if there was a reference to \A and \Z/\z indicating that these are most likely wanted, if one as and input with multiple lines and wants to match against all of them.

But TBH, I'd also be okay if you change nothing and close this issue.

btw: What's the best way to simply match against the whole input (regardless of whether it has multiple lines and/or contains 0x0)? I would now assume it is in fact -M just with \A and \z?

PhilipHazel commented 11 months ago

Thanks for the comment. I will close this issue once I get round to doing some edits on the documentation. Yes, if you use -M and put your regex between \A and \z you might match the whole input, with the caveat that in pcre2grep the input must be short enough to fit within the buffer size (see --buffer-size). However, I say 'might' because experiment shows that it depends on the regex. I naively tried \A.*\z against 'abc\nxyz' and it just matched xyz. This is because the dot (.) doesn't match newlines by default, so the first match failed and pcre2grep, being really a line-based thing, started again at the second line. If you include (?s) in the pattern then you do get the whole string in this example.

calestyo commented 11 months ago

Yes, that with the s option was clear.

Does it at least error out if the buffer size ain't enough?

PhilipHazel commented 11 months ago

I haven't got around to trying, but I expect it would just give no matches.

carenas commented 11 months ago

Does it at least error out if the buffer size ain't enough?

no, but you are right that we "might" need to report that back to the user somehow, as it would otherwise result in confusion as shown by #250

FWIW, grep -P and pcre2grep are not equivalent, and one of the main differences is that GNU grep doesn't do multiline at all, and is indeed much slower as a result.

PhilipHazel commented 11 months ago

I have now done some checking. You said: "What IMO could actually help is clarifying the following from the CHARACTERS AND METACHARACTERS". This already exists in the pcre2pattern document in a section entitled "CIRCUMFLEX AND DOLLAR", which follows the BACKSLASH section (admittedly that one is quite long).

I have added some words to the pcre2grep.1 man page. The bottom line is that is it not possible to get pcre2grep to match the entire contents of a file and nothing else. This is because pcre2grep is basically line-based. The multiline feature doesn't change this. WIthout -M the action is "search line 1, search line 2, search line 3, ..." whereas with -M the action is "search starting with line 1, search starting with line 2, search starting with line 3, ..." and \A matches at the start of each search. By contrast, with -M, \z can match only at the end of the file. So \A...\z matches one or more lines at the end of the file.

Having said that, there is a hack that can make it work. Use the --newline option to set a newline type that your file does not use. For example --newline=NUL. Then pcre2grep will treat your file as consisting of just one line.

As far as the buffering goes, at present there is an error if any one line is too long for the buffer, but that is all. I'm not sure what else can be done.