PCRE2Project / pcre2

PCRE2 development is now based here.
Other
910 stars 191 forks source link

New feature: substring matching assertion #442

Closed zherczeg closed 2 months ago

zherczeg commented 2 months ago

There is a feature I have been thinking for some time. I often need to extract a string from an input (e.g. a log file), and needs to find something inside. The patterns to do this are often very complex. It would be great, if an assertion would be available, which performs a match on a substring. Probably the backtracking variant (similar to napla) would be enough.

Example: /(\w+)(*match:(1).*xy)/

It searches words, which contains the xy substring. I know in this simple case, there is no need such a feature, but I wanted to show a simple example.

The name of the assertion would be *match, and its argument is a capturing bracket. The argument follows the same style as argument for conditionals.

From implementation perspective, the original match start/end would be saved before the execution of *match, and a new start/end would be set based on the capturing bracket. Unset capturing brackets fails this assertion before matching anything. If the matching is completed, the original start/end is restored. The start/end is saved when the assertion is executed, so it is set to the same value on backtrack, even if the capturing bracket is changed. Changing start/end affects some constructs, such as \z. This is expected.

Philip, what do you think of this feature? Is it difficult to do it in the interpreter? In JIT I would simply save the original and new start/end onto the stack, and use these values.

PhilipHazel commented 2 months ago

It will only be straightforwardly possible in the pcre2_match() interpreter if there is enough room in the backtracking frame structure to save all that must be saved. I would not want to increase the size of the backtracking frames. I have had a quick look and the answer seems to be "maybe". More study needed to be sure. As for pcre2_dfa_match(), it can't happen because that interpreter does not support capturing. Maybe this is another exciting project for Alex. He was wanting to learn about the interpreter. I will as him.

alexdowad commented 2 months ago

Hmm. Thanks very much to @zherczeg and @PhilipHazel for their comments.

PCRE stands for "Perl Compatible Regular Expressions", so do we want to add features which are not present in Perl? Maybe PCRE has already gone that way for some time?

Aside from that concern, if @PhilipHazel thinks that it is a good idea to support this feature, I am happy to work on the implementation.

PhilipHazel commented 2 months ago

In the past, PCRE has added features that were not in Perl - recursive subroutine calls being a good example - and the later Perl has caught up (sometimes an adjustment had to happen to align with how Perl did things). Also, there are a number of options for non-Perlish things, so I am not really worried about that aspect of it. I have always tried, however, not to be incompatible with Perl, though the pcre2compat man page does list some differences. As far as this feature goes, Zoltan clearly has a need it, so I think it is worth investigating the possibility.

zherczeg commented 2 months ago

Started sketching some code: https://github.com/zherczeg/pcre2/commit/74634f9e1aea18d564512a4a429c8eb5d68a5216

I called it nala to follow the assertion naming. That stands for Non-atomic lookahead. We could use any name though.

Usually a lot of things needs to be updated when a new opcode is added. That is the hard part. Anyway the first step is adding the new construct to the parser. It seems META_COND_NUMBER assumes the opcode is OP_COND, so that needs some changes. That is the next step. I am open to any suggestions, I will likely forget to update many things.

@alexdowad if you want to work on it, that is also ok for me. Feel free to use the code above.

alexdowad commented 2 months ago

Just a thought... do you think the Perl people would have any useful input on the best syntax and semantics for this new feature?

zherczeg commented 2 months ago

You can try it.

PhilipHazel commented 2 months ago

I don't like "nala" as the name; too confusing with the existing "napla" and also it isn't really the same kind of assertion. I think that "group" should somehow get included. A long name might be "groupmatch" and it could be abbreviated to "gm", though that doesn't of itself suggest an assertion. Or how about "capturematch" == "cm"? Or something else......

zherczeg commented 2 months ago

I have already changed it to nasla, as substring lookahead. I would like to emphasize that it is a lookahead assertion, so you can expect the same behaviour, except it works on a substring. Assertions are complex enough with their custom rules.

zherczeg commented 2 months ago
  re> /(\w+)(*nasla:(1)xy)/fullbincode
------------------------------------------------------------------
  0  26 Bra
  3   7 CBra 1
  8     \w+
 10   7 Ket
 13  10 Non-atomic substring look ahead
 16   1 Cond ref
 19     xy
 23  10 Ket
 26  26 Ket
 29     End
------------------------------------------------------------------
data>
  re> /(?<AA>\w+)(*nasla:(<AA>)xy)/fullbincode
------------------------------------------------------------------
  0  26 Bra
  3   7 CBra 1
  8     \w+
 10   7 Ket
 13  10 Non-atomic substring look ahead
 16   1 Cond ref
 19     xy
 23  10 Ket
 26  26 Ket
 29     End
------------------------------------------------------------------

Unfortunately I know little about the parser, so the code might have full of bugs. I tried to follow the parsing of conditionals. https://github.com/zherczeg/pcre2/tree/nala

I am still unsure about the name. I don't like match (which refers to exact match), or search (which refers to finding something). I would prefer it to work like an assertion, since it restores the string pointer like other assertions, and it does not introduce a new behaviour. Regexes are complex enough.

PhilipHazel commented 2 months ago

I'm happy with "substring" (similar to my "capture") but I still don't like "lookahead" because it isn't a lookahead. How about nass (non-atomic-substring-scan)?

On Fri, 23 Aug 2024 at 10:28, Zoltan Herczeg @.***> wrote:

I have already changed it to nasla, as substring lookahead. I would like to emphasize that it is a lookahead assertion, so you can expect the same behaviour, except it works on a substring. Assertions are complex enough with their custom rules.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2306684178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUACU5BSJDRULZIO3Q7TZS36FPAVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBWGY4DIMJXHA . You are receiving this because you were mentioned.Message ID: @.***>

PhilipHazel commented 2 months ago

The parsing output you show looks reasonable, and I guess the use of "cond ref" is ok because it's a reference like a condition, though it isn't a condition. But is it worth inventing a new opcode just to make that tidy? I'm not sure.

zherczeg commented 2 months ago

I just realized that the C in OP_CREF means "cond" not "capture". Because in OP_RREF there is no C. I always thought these are independent from conditionals, and they can be used elsewhere in the future. They simply represent arguments, for me at least. Could we simply change the debug info to "Capture ref"?

I am ok with nass. If somebody wants an atomic version in the future, would it be 'ss'? What about 'nascan' and scan? Or 'scan' for the non-atomic, and ascan for the atomic version. I realized we don't need to follow the naming convention of la assertions. From technical perspective deciding a name is not too interesting, but it cannot be changed, so it is still crucial.

PhilipHazel commented 2 months ago

"Capture ref" is a good thought, yes, let's go with that. Choosing names for things is always tedious. "The naming of cats is a difficult matter" wrote T.S. Eliot. Hmm. Using "ass" for "atomic substring scan" is probably not a good idea, but in practice, would we ever implement it when writing (*nass:(1)(?>...)) would provide the facility? Actually, the more I think about it, I'm not sure we need anything about atomic in the name. It really is just "substring scan" isn't it? Or "scan substring", which could be abbreviated to "scansub".

zherczeg commented 2 months ago

For me scansub is a bit long. What about scang as scan group, or scans as scan substring?

PhilipHazel commented 2 months ago

All the special (*things) have long and short names, so scansub could be scs or scangroup could be scg.

On Fri, 23 Aug 2024, 17:55 Zoltan Herczeg, @.***> wrote:

For me scansub is a bit long. What about scang as scan group, or scans as scan substring?

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2307463965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAABSSTFV7EP4IWH3X3ZS5SOZAVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXGQ3DGOJWGU . You are receiving this because you were mentioned.Message ID: @.***>

zherczeg commented 2 months ago

Ok. Which one then?

PhilipHazel commented 2 months ago

In keeping with the style of non_atomic_positive_lookahead how about scan_substring with synonm *scs?

zherczeg commented 2 months ago

I like this suggestion. I will change the name, then. Hopefully the last time :)

zherczeg commented 2 months ago

@PhilipHazel I made some progress on this work, and opened a pull request for it: #445

Unfortunately I know little about the parser/interpreter, so I likely missed to update many necessary tables / switches. I plan to do the jit support in another patch, and I would be grateful if you could help me with the documentation.

PhilipHazel commented 2 months ago

I will do the documentation, and I hope to find time tomorrow to look at this.

On Mon, 26 Aug 2024, 13:38 Zoltan Herczeg, @.***> wrote:

@PhilipHazel https://github.com/PhilipHazel I made some progress on this work, and opened a pull request for it: #445 https://github.com/PCRE2Project/pcre2/pull/445

Unfortunately I know little about the parser/interpreter, so I likely missed to update many necessary tables / switches. I plan to do the jit support in another patch, and I would be grateful if you could help me with the documentation.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2310113101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUACQAYHQ2OWIYP54YFDZTMOU7AVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGEYTGMJQGE . You are receiving this because you were mentioned.Message ID: @.***>

zherczeg commented 2 months ago

I found an interesting test case:

jit (correct)

  re> /([a-z]{2})[a-z](*scs:(1)(.*?))\2$/
data> abcabc
 0: bcabc
 1: bc
 2: bc

interpreter (incorrect)

  re> /([a-z]{2})[a-z](*scs:(1)(.*?))\2$/
data> abcabc
 0: abc
 1: ab
 2:

I suspect the problem is here: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L6163

This code sets the correct end, but never set it back when the engine backtracks into the scan substring assertion. Probably we should do another RMATCH here, and RRETURN(rrc) whatever happens. We might be able to preserve the N block, or need to allocate a new one (if allocation size can be a problem). If you have any idea, please let me know.