Closed zherczeg closed 2 months ago
It will only be straightforwardly possible in the pcre2_match() interpreter if there is enough room in the backtracking frame structure to save all that must be saved. I would not want to increase the size of the backtracking frames. I have had a quick look and the answer seems to be "maybe". More study needed to be sure. As for pcre2_dfa_match(), it can't happen because that interpreter does not support capturing. Maybe this is another exciting project for Alex. He was wanting to learn about the interpreter. I will as him.
Hmm. Thanks very much to @zherczeg and @PhilipHazel for their comments.
PCRE stands for "Perl Compatible Regular Expressions", so do we want to add features which are not present in Perl? Maybe PCRE has already gone that way for some time?
Aside from that concern, if @PhilipHazel thinks that it is a good idea to support this feature, I am happy to work on the implementation.
In the past, PCRE has added features that were not in Perl - recursive subroutine calls being a good example - and the later Perl has caught up (sometimes an adjustment had to happen to align with how Perl did things). Also, there are a number of options for non-Perlish things, so I am not really worried about that aspect of it. I have always tried, however, not to be incompatible with Perl, though the pcre2compat man page does list some differences. As far as this feature goes, Zoltan clearly has a need it, so I think it is worth investigating the possibility.
Started sketching some code: https://github.com/zherczeg/pcre2/commit/74634f9e1aea18d564512a4a429c8eb5d68a5216
I called it nala
to follow the assertion naming. That stands for Non-atomic lookahead
. We could use any name though.
Usually a lot of things needs to be updated when a new opcode is added. That is the hard part. Anyway the first step is adding the new construct to the parser. It seems META_COND_NUMBER
assumes the opcode is OP_COND, so that needs some changes. That is the next step. I am open to any suggestions, I will likely forget to update many things.
@alexdowad if you want to work on it, that is also ok for me. Feel free to use the code above.
Just a thought... do you think the Perl people would have any useful input on the best syntax and semantics for this new feature?
You can try it.
I don't like "nala" as the name; too confusing with the existing "napla" and also it isn't really the same kind of assertion. I think that "group" should somehow get included. A long name might be "groupmatch" and it could be abbreviated to "gm", though that doesn't of itself suggest an assertion. Or how about "capturematch" == "cm"? Or something else......
I have already changed it to nasla
, as substring lookahead. I would like to emphasize that it is a lookahead assertion, so you can expect the same behaviour, except it works on a substring. Assertions are complex enough with their custom rules.
re> /(\w+)(*nasla:(1)xy)/fullbincode
------------------------------------------------------------------
0 26 Bra
3 7 CBra 1
8 \w+
10 7 Ket
13 10 Non-atomic substring look ahead
16 1 Cond ref
19 xy
23 10 Ket
26 26 Ket
29 End
------------------------------------------------------------------
data>
re> /(?<AA>\w+)(*nasla:(<AA>)xy)/fullbincode
------------------------------------------------------------------
0 26 Bra
3 7 CBra 1
8 \w+
10 7 Ket
13 10 Non-atomic substring look ahead
16 1 Cond ref
19 xy
23 10 Ket
26 26 Ket
29 End
------------------------------------------------------------------
Unfortunately I know little about the parser, so the code might have full of bugs. I tried to follow the parsing of conditionals. https://github.com/zherczeg/pcre2/tree/nala
I am still unsure about the name. I don't like match (which refers to exact match), or search (which refers to finding something). I would prefer it to work like an assertion, since it restores the string pointer like other assertions, and it does not introduce a new behaviour. Regexes are complex enough.
I'm happy with "substring" (similar to my "capture") but I still don't like "lookahead" because it isn't a lookahead. How about nass (non-atomic-substring-scan)?
On Fri, 23 Aug 2024 at 10:28, Zoltan Herczeg @.***> wrote:
I have already changed it to nasla, as substring lookahead. I would like to emphasize that it is a lookahead assertion, so you can expect the same behaviour, except it works on a substring. Assertions are complex enough with their custom rules.
— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2306684178, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUACU5BSJDRULZIO3Q7TZS36FPAVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBWGY4DIMJXHA . You are receiving this because you were mentioned.Message ID: @.***>
The parsing output you show looks reasonable, and I guess the use of "cond ref" is ok because it's a reference like a condition, though it isn't a condition. But is it worth inventing a new opcode just to make that tidy? I'm not sure.
I just realized that the C in OP_CREF means "cond" not "capture". Because in OP_RREF there is no C. I always thought these are independent from conditionals, and they can be used elsewhere in the future. They simply represent arguments, for me at least. Could we simply change the debug info to "Capture ref"?
I am ok with nass
. If somebody wants an atomic version in the future, would it be 'ss'? What about 'nascan' and scan
? Or 'scan' for the non-atomic, and ascan
for the atomic version. I realized we don't need to follow the naming convention of la
assertions. From technical perspective deciding a name is not too interesting, but it cannot be changed, so it is still crucial.
"Capture ref" is a good thought, yes, let's go with that. Choosing names for things is always tedious. "The naming of cats is a difficult matter" wrote T.S. Eliot. Hmm. Using "ass" for "atomic substring scan" is probably not a good idea, but in practice, would we ever implement it when writing (*nass:(1)(?>...)) would provide the facility? Actually, the more I think about it, I'm not sure we need anything about atomic in the name. It really is just "substring scan" isn't it? Or "scan substring", which could be abbreviated to "scansub".
For me scansub
is a bit long. What about scang
as scan group, or scans
as scan substring?
All the special (*things) have long and short names, so scansub could be scs or scangroup could be scg.
On Fri, 23 Aug 2024, 17:55 Zoltan Herczeg, @.***> wrote:
For me scansub is a bit long. What about scang as scan group, or scans as scan substring?
— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2307463965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAABSSTFV7EP4IWH3X3ZS5SOZAVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXGQ3DGOJWGU . You are receiving this because you were mentioned.Message ID: @.***>
Ok. Which one then?
In keeping with the style of non_atomic_positive_lookahead how about scan_substring with synonm *scs?
I like this suggestion. I will change the name, then. Hopefully the last time :)
@PhilipHazel I made some progress on this work, and opened a pull request for it: #445
Unfortunately I know little about the parser/interpreter, so I likely missed to update many necessary tables / switches. I plan to do the jit support in another patch, and I would be grateful if you could help me with the documentation.
I will do the documentation, and I hope to find time tomorrow to look at this.
On Mon, 26 Aug 2024, 13:38 Zoltan Herczeg, @.***> wrote:
@PhilipHazel https://github.com/PhilipHazel I made some progress on this work, and opened a pull request for it: #445 https://github.com/PCRE2Project/pcre2/pull/445
Unfortunately I know little about the parser/interpreter, so I likely missed to update many necessary tables / switches. I plan to do the jit support in another patch, and I would be grateful if you could help me with the documentation.
— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/442#issuecomment-2310113101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUACQAYHQ2OWIYP54YFDZTMOU7AVCNFSM6AAAAABM3UEXBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGEYTGMJQGE . You are receiving this because you were mentioned.Message ID: @.***>
I found an interesting test case:
jit (correct)
re> /([a-z]{2})[a-z](*scs:(1)(.*?))\2$/
data> abcabc
0: bcabc
1: bc
2: bc
interpreter (incorrect)
re> /([a-z]{2})[a-z](*scs:(1)(.*?))\2$/
data> abcabc
0: abc
1: ab
2:
I suspect the problem is here: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L6163
This code sets the correct end, but never set it back when the engine backtracks into the scan substring assertion.
Probably we should do another RMATCH
here, and RRETURN(rrc) whatever happens. We might be able to preserve the N block, or need to allocate a new one (if allocation size can be a problem). If you have any idea, please let me know.
There is a feature I have been thinking for some time. I often need to extract a string from an input (e.g. a log file), and needs to find something inside. The patterns to do this are often very complex. It would be great, if an assertion would be available, which performs a match on a substring. Probably the backtracking variant (similar to napla) would be enough.
Example:
/(\w+)(*match:(1).*xy)/
It searches words, which contains the
xy
substring. I know in this simple case, there is no need such a feature, but I wanted to show a simple example.The name of the assertion would be
*match
, and its argument is a capturing bracket. The argument follows the same style as argument for conditionals.From implementation perspective, the original match start/end would be saved before the execution of
*match
, and a new start/end would be set based on the capturing bracket. Unset capturing brackets fails this assertion before matching anything. If the matching is completed, the original start/end is restored. The start/end is saved when the assertion is executed, so it is set to the same value on backtrack, even if the capturing bracket is changed. Changing start/end affects some constructs, such as\z
. This is expected.Philip, what do you think of this feature? Is it difficult to do it in the interpreter? In JIT I would simply save the original and new start/end onto the stack, and use these values.