PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

Support a new experimental feature called scan substring #445

Closed zherczeg closed 3 months ago

zherczeg commented 3 months ago

This patch contains the core code of a new experimental feature called scan substring. It allows anchored matching the content of a capture group to a subpattern.

The parser (compiler) and interpreter is updated to support it. Dfa interpreter will not support it. JIT will support it in the future. Documenation is missing, I hope @PhilipHazel can help me with that. Several tests were added.

zherczeg commented 3 months ago

I have noticed that \z does not work, because it uses true_end_subject. Interestingly, \Z uses end_subject instead, and it seems jit does not support this at all.

zherczeg commented 3 months ago

Some comments besides the \z.

What is better: scan_substring or scan_string? I am not sure we need to emphasize the sub part, it does not add much value in this particular case.

The OP_CREF comment (Cond Ref -> Capture Ref) will be done in a different patch. I suspect it will affect a lot of lines, and I don't want to complicate this work with that.

PhilipHazel commented 3 months ago

I prefer scan_substring because that is more precise.

On Mon, 26 Aug 2024, 18:55 Zoltan Herczeg, @.***> wrote:

Some comments besides the \z.

What is better: scan_substring or scan_string? I am not sure we need to emphasize the sub part, it does not add much value in this particular case.

The OP_CREF comment (Cond Ref -> Capture Ref) will be done in a different patch. I suspect it will affect a lot of lines, and I don't want to complicate this work with that.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/pull/445#issuecomment-2310754261, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUADDTYZOYCCMH7WQWLDZTNTZZAVCNFSM6AAAAABND5ILV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQG42TIMRWGE . You are receiving this because you were mentioned.Message ID: @.***>

zherczeg commented 3 months ago

We could talk about the syntax as well. We could use: (*scs:1:...) - this requires one less character, but looks unusal in regex world (*scs(1)...) - this requires one less character, but could looks confusing

I prefer the current one, but it is worth to discuss the alternatives.

PhilipHazel commented 3 months ago

I prefer parentheses.

On Mon, 26 Aug 2024, 19:00 Zoltan Herczeg, @.***> wrote:

We could talk about the syntax as well. We could use: (scs:1:...) - this requires one less character, but looks unusal in regex world (scs(1)...) - this requires one less character, but could looks confusing

I prefer the current one, but it is worth to discuss the alternatives.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/pull/445#issuecomment-2310763262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAAGNHFWPWS3T4E4EJTZTNUMRAVCNFSM6AAAAABND5ILV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQG43DGMRWGI . You are receiving this because you were mentioned.Message ID: @.***>

PhilipHazel commented 3 months ago

As you can see, I've done the merge. All looks good and I will work on the documentation - maybe try some tests of my own. There is one thing: around line 685 of pcre2_compile.c you have this:

   {  3, META_SCS_NUMBER        }, /* placeholder, updated later */
  { 14, META_SCS_NUMBER        }, /* placeholder, updated later */

It looks as if that is a typo and that the second line should be NAME, not NUMBER. I see from the later code that it doesn't matter, but I think this should be changed because it is a trap. It would need one extra "case" later on, but I think that's all. I'm happy to make the change if you agree.

zherczeg commented 3 months ago

Actually both can be NAME/NUMBER. The first represents the *scs, and the second represents the *sub_string identifiers. Both work with NAME/NUMBER. The comment wants to suggest this. Probably not the best comment.

PhilipHazel commented 3 months ago

Two things: (1) Something needs to be done about \z vs \Z, as you pointed out. In the documentation for invalid UTF, where the subject is broken into substrings, it does state that these do not match the end of a substring. So I think this is a bug in \Z handling in this case and I will test for this and fix it. However, I do feel that these should match in the scs case, but that would require a "depth of scs" count in order to handle nested scs, but this ought to be straightforward. I will look at that too.

(2) While inventing some extra tests, I have found that an scs assertion doesn't work as a condition. I think it should, and I will work on that as well (should be easy, no need for both of us to be working on the same thing).

Then I think I have enough information to work on the docs!

zherczeg commented 3 months ago

(1) Ok for me. I don't know why there was a difference, I just mentioned it.

(2) What does "condition" means here? Conditional block? It is not an atomic assertion, so it cannot work there.

PhilipHazel commented 3 months ago

Oh, yes, of course, condition assertions must be atomic, but I got an odd error when I tried it. I will check again and make sure the error is useful.

On Tue, 27 Aug 2024 at 19:08, Zoltan Herczeg @.***> wrote:

(1) Ok for me. I don't know why there was a difference, I just mentioned it.

(2) What does "condition" means here? Conditional block? It is not an atomic assertion, so it cannot work there.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/pull/445#issuecomment-2313206917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAH62VCRKP7H7SJGFFTZTS6B3AVCNFSM6AAAAABND5ILV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGIYDMOJRG4 . You are receiving this because you modified the open/close state.Message ID: @.***>

PhilipHazel commented 2 months ago

I have committed my patches, which make \z, \Z, and $ match at the end of the captured string. However, this is anomalous because \A and ^ do not match at the start. For uniformity, \z etc should not match (as I think you originally suggested) unless it really is the end of the original subject but I do think it could be useful to match the end of the capture. The start doesn't matter because the match only takes place at the start - it's effectively anchored - so I propose leaving it as it is and documenting this carefully.

I also fixed a missing label in pcre2_study(). I will now start on the documentation.