Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.96k stars 555 forks source link

[EXPERIMENT] variable-length look-behind #18756

Open rjbs opened 3 years ago

rjbs commented 3 years ago

Limited variable-length look-behind was first released in perl v5.30.0 as an experimental feature. This issue tracks its progress toward the end of its experimental phase.

rjbs commented 3 years ago

I have proposed on p5p that we mark this a success.

jkeenan commented 3 years ago

I have proposed on p5p that we mark this a success.

Do we have any criteria for determining whether an experiment has been successful or not (other than "people haven't complained about it")?

hvds commented 3 years ago

The main issue in my mind is whether the semantics are sane.

If I understand it correctly, when we determine the lookbehind expression has width in the range {m, n}, we attempt to match the whole expression on the anchored substring s[-n:-1]; if that fails, we try again on s[-n+1:-1] and repeat up to s[-m:-1]. That effectively means we prefer the longest match, unlike the rest of the regexp engine.

Thus for example a?? will ignore the requested minimality and ($FOO|$BAR) will prefer the longer rather than the first of the alternates.

This won't usually affect whether something matches, but it can affect captures, and could defeat attempts to optimize a pattern using the normal rules. Not sure what other issues it could cause.

% perl -E 'no warnings qw{experimental}; "abfoo" =~ /(?=foo)(?<=(a??b))/ and say $1'
ab
% perl -E 'no warnings qw{experimental}; "abfoo" =~ /(?=foo)(?<=(b|ab))/ and say $1'
ab
%

(I had a typo in the above first time I tried it, which led to #19168.)

I don't have a solution to offer, it may be that these semantics are ideal, or that they are more nearly ideal than any other. However I don't recall to what extent this was discussed when VLB was first implemented, and it's worth considering before we bless them as final.

hvds commented 2 years ago

@demerphq has now looked at #19168, which showed that VLB currently has a major bug; that led to PR #19442. I think the PR has a good chance of making it for the upcoming release (needs more tests, and more eyes), but this case further convinces me that VLB as a whole is not ready to be marked a success.

I still hope someone will one day comment on the issues mentioned in my previous comment here.

demerphq commented 2 years ago

On Mon, 4 Oct 2021, 06:22 Hugo van der Sanden, @.***> wrote:

The main issue in my mind is whether the semantics are sane.

If I understand it correctly, when we determine the lookbehind expression has width in the range {m, n}, we attempt to match the whole expression on the anchored substring s[-n:-1]; if that fails, we try again on s[-n+1:-1] and repeat up to s[-m:-1]. That effectively means we prefer the longest match, unlike the rest of the regexp engine.

The rule is "leftmost longest", so this seems fine to me. Am I missing a subtlety here?

The fact that the range is bounded is more problematic to me in the sense that it will cause surprise.

Thus for example a?? will ignore the requested minimality

But again, the rule is leftmost longest. So we will attempt to match at the leftmost position, we will try to match the empty string, fail because that can't match at that position, then try "a", assuming we fail we will advance the cursor and try again. I don't see a problem here, in theory anyway.

In practice I'm not 100% certain whether (?<=) will match with my patch.(I'm writing this on my phone) I suspect it won't as we require the cursor to line up with where it started after matching. We may have to refine the patch a touch.

and ($FOO|$BAR) will prefer the longer rather than the first of the alternates.

But again, the rule is leftmost longest. So we will try the leftmost alternation at the leftmost position. So I don't see a problem here.

I get the feeling you are thinking that in a lookbehind you are expecting the semantics to behave as though we are matching right to left and thus with "mirrored" semantics from left to right. That is one of the reasonable possibilities i suppose, especially if you dont think of the regexe engine as simulating a DFA, although not the one I would have expected myself, and not the one we have implemented. If you think of this as a DFA however it doesn't make sense.

Consider the pattern

/[a-z]+(?<!m+)/

I would expect this to be formally equivalent to

/[a-ln-z]+/

In a DFA construction. The mirror interpretation would be

/[a-z]*[a-ln-m]+/

The length restrictions make it something else yet again.

Frankly lookbehind is full of ambiguity no matter how you slice it. It sounds like it should be well defined in a formal sense when you discuss simple case cases but IMO it is not at a formal level of the mathematics of "regular expressions" - never mind that perls regex engine strictly speaking doesn't implement mathematical "regular expressions", we still strive to be as close as possible and use it as a model to understand what is happening.

This won't usually affect whether something matches, but it can affect

captures, and could defeat attempts to optimize a pattern using the normal rules. Not sure what other issues it could cause.

% perl -E 'no warnings qw{experimental}; "abfoo" =~ /(?=foo)(?<=(a??b))/ and say $1' ab % perl -E 'no warnings qw{experimental}; "abfoo" =~ /(?=foo)(?<=(b|ab))/ and say $1' ab %

Both of these match as I would expect based on my understanding of "leftmost longest".

(I had a typo in the above first time I tried it, which led to #19168 https://github.com/Perl/perl5/issues/19168.)

I don't have a solution to offer, it may be that these semantics are ideal, or that they are more ideal than any other. However I don't recall to what extent this was discussed when VLB was first implemented, and it's worth considering before we bless them as final.

I think this is not ready to be declared non experimental. Perhaps there are implementation details I am unaware of at this time that would change my mind, but based on what I know right now I have serious doubts.

The length restriction is to me hugely problematic. Consider how the length restriction affects the cases above, and consider something like this:

/\w+(?<!a+b+c+)/

I'd expect that to match a sequence of word characters excluding 'a' followed by a sequence of word characters excluding 'b', followed by a sequence of word characters excluding 'c'. But with the length restriction it will match something quite different, i am pretty sure with a bit of thought i could come with a case where it matched something that was in direct contradiction to the "correct" interpretation.

Possibly we can save the situation and make it an error to put something in a lookbehind that might match more than 256 characters.

Cheers Yves

rjbs commented 2 years ago

Thanks, you two. I have been pursuing this exiting experimental only because it seemed to be not moving and to work. I am not in a rush, and will back off until post-5.36!

demerphq commented 2 years ago

On Mon, 21 Feb 2022, 12:11 Ricardo Signes, @.***> wrote:

Thanks, you two. I have been pursuing this exiting experimental only because it seemed to be not moving and to work. I am not in a rush, and will back off until post-5.36!

I have had a chance to review this in more detail and I am very pleased to say I was very wrong. Yesterday I could have sworn unlimited quantifiers were allowed in lookbehind, but I guess I tested something else, as I said in the commit message my brain was mush by the end of the day. Every case I expected to be broken seems to be covered. (Yay Karl!) Provided my PR is applied and we add a bunch more tests and no further surprises are revealed I think we will be good for 5.36 contrary to what I said earlier.

I apologize to Karl for doubting him. He did a great job with the max length checks.

I will follow up with more tests and maybe a slight optimization but I think we are good!

Yves

khwilliamson commented 2 years ago

On 2/20/22 21:34, Yves Orton wrote:

On Mon, 21 Feb 2022, 12:11 Ricardo Signes, @.***> wrote:

Thanks, you two. I have been pursuing this exiting experimental only because it seemed to be not moving and to work. I am not in a rush, and will back off until post-5.36!

I have had a chance to review this in more detail and I am very pleased to say I was very wrong. Yesterday I could have sworn unlimited quantifiers were allowed in lookbehind, but I guess I tested something else, as I said in the commit message my brain was mush by the end of the day. Every case I expected to be broken seems to be covered. (Yay Karl!) Provided my PR is applied and we add a bunch more tests and no further surprises are revealed I think we will be good for 5.36 contrary to what I said earlier.

I apologize to Karl for doubting him. He did a great job with the max length checks.

I will follow up with more tests and maybe a slight optimization but I think we are good!

Yves

Thanks, but I'm not so sure it's ready to be de-experimentalized.

Are we sure we have the correct semantics? I am now thinking it should be a mirror of the lookahead assertions, starting at 0, then -1, -2, ... That would be the most intuitive, and would mean no real performance penalty for long lookbehinds.

So, some background information.

A few of the world's language scripts have upper/lower case; mostly those derived from ancient Greek. Of the relatively few characters in Unicode that have case, about 10% can match under /i a sequence of characters. In modern Western European languages, this is notably the German ß character whose traditional upper case is the sequence SS.

Perl did not handle this situation very well, and I started fixing various areas where it failed. In doing so, this broke innocent code that was using lookbehind.

Technically, in Unicode password and paßword should match under /i. This means that if you have a lookbehind assertion that matches 'ss', it also should match the single character ß. Hence it is variable length. Before I fixed things, perl simply ignored the ß possibility. But after I did, it would complain about it being variable length.

Rather than moving the language backwards, the obvious solution was to allow some, at least limited, form of variable length lookbehind. As I wrote the patch, I didn't see a clear place as to how to allow things like this, but forbid more general cases. Besides, , ISTR people had been complaining about the fixed-length restriction anyway.

I did not invent the 255 byte length limit. I inherited that. That limit has always been the case AFAIK. Almost certainly it stems from a single byte in the C structure for a regnode being available for use. I have not seen any field complaints about that number being too small. But we could create new regnodes which occupy more bytes so as to increase the limit.

What I did was merely change the limit from a fixed size to a maximum size. I was trying to be the least disruptive as I could of quite obtuse code. I think I did find a bug or two along the way in the existing implementation.

I didn't consider at the time what the semantics should be. But now I'm thinking that ideally lookbehind should act the same as lookahead but with the sign of the directionality changed from positive to negative. That might be too hard to achieve, or maybe turn not to have other drawbacks, but until we think about it, and make some determination, we shouldn't de-experimentalize the feature

demerphq commented 2 years ago

On Tue, 22 Feb 2022 at 15:05, Karl Williamson @.***> wrote:

On 2/20/22 21:34, Yves Orton wrote:

On Mon, 21 Feb 2022, 12:11 Ricardo Signes, @.***> wrote:

Thanks, you two. I have been pursuing this exiting experimental only because it seemed to be not moving and to work. I am not in a rush, and will back off until post-5.36!

I have had a chance to review this in more detail and I am very pleased to say I was very wrong. Yesterday I could have sworn unlimited quantifiers were allowed in lookbehind, but I guess I tested something else, as I said in the commit message my brain was mush by the end of the day. Every case I expected to be broken seems to be covered. (Yay Karl!) Provided my PR is applied and we add a bunch more tests and no further surprises are revealed I think we will be good for 5.36 contrary to what I said earlier.

I apologize to Karl for doubting him. He did a great job with the max length checks.

I will follow up with more tests and maybe a slight optimization but I think we are good!

Yves

Thanks, but I'm not so sure it's ready to be de-experimentalized.

Are we sure we have the correct semantics? I am now thinking it should be a mirror of the lookahead assertions, starting at 0, then -1, -2, ... That would be the most intuitive, and would mean no real performance penalty for long lookbehinds.

So, some background information.

A few of the world's language scripts have upper/lower case; mostly those derived from ancient Greek. Of the relatively few characters in Unicode that have case, about 10% can match under /i a sequence of characters. In modern Western European languages, this is notably the German ß character whose traditional upper case is the sequence SS.

Perl did not handle this situation very well, and I started fixing various areas where it failed. In doing so, this broke innocent code that was using lookbehind.

Technically, in Unicode password and paßword should match under /i. This means that if you have a lookbehind assertion that matches 'ss', it also should match the single character ß. Hence it is variable length. Before I fixed things, perl simply ignored the ß possibility. But after I did, it would complain about it being variable length.

Rather than moving the language backwards, the obvious solution was to allow some, at least limited, form of variable length lookbehind. As I wrote the patch, I didn't see a clear place as to how to allow things like this, but forbid more general cases. Besides, , ISTR people had been complaining about the fixed-length restriction anyway.

I did not invent the 255 byte length limit. I inherited that. That limit has always been the case AFAIK. Almost certainly it stems from a single byte in the C structure for a regnode being available for use. I have not seen any field complaints about that number being too small. But we could create new regnodes which occupy more bytes so as to increase the limit.

Indeed.

What I did was merely change the limit from a fixed size to a maximum size. I was trying to be the least disruptive as I could of quite obtuse code. I think I did find a bug or two along the way in the existing implementation.

Or five? :-)

I didn't consider at the time what the semantics should be. But now I'm thinking that ideally lookbehind should act the same as lookahead but with the sign of the directionality changed from positive to negative. That might be too hard to achieve, or maybe turn not to have other drawbacks, but until we think about it, and make some determination, we shouldn't de-experimentalize the feature

I am curious on what grounds you say "ideally" here? This isn't a "regular" construct, I do not think you can convert lookbehind into a true DFA construction (eg, one that moves left to right and inspects each byte only once), therefore there doesn't seem to be an ideal here at all. This is demonstrated by the various ways that lookbehind is implemented in other regex engines. See

https://www.regular-expressions.info/lookaround.html

where there is a pretty good summary of the different implementations and meanings for variable length lookbehind. Unfortunately there is not a consensus. Some treat lookbehind as atomic (as far as I can tell we do not[1]), some match truly right to left, (we do not), some match shortest to longest, some match longest to shortest like we do. Some match in "alternation order". PCRE matches in alternation order but atomically. So whatever we do we are aligned with some other regex engines and not aligned with others.

It seems to me that whatever choice we make with the current implementation we violate the expectation that alternatives should match in the order they are specified.

Consider:

"aafoo"=~/(?=foo)(?<=(a|aa))/

With our current max-left to min-left with left to right semantics model $1 will end up as "aa". This violates the expectation that it matches "a" as it is the first alternation.

But if we change it as you say to min-left to max-left with left-to-right semantics we would break this:

"aaafoo"=~/(?=foo)(?<=(aa|a))/

and we would match "a" first, which would violate the expectation that we match "aa". What this says to me is that as long as we match with "normal left-to-right" semantics as we currently do we are going to do the "wrong" thing sometimes with positive lookbehind (negative lookbehind doesnt capture, so these questions are irrelevant).

Arguably we should be converting both into an alternation of lookbehinds (thanks to Hugo for this observation), which would then behave as expected. This is afaik how PCRE would match, (except it would treat the construct as atomic).

"aafoo"=~/(?=foo)(?<=(aa|a))/ "aafoo"=~/(?=foo)(?|(?<=(aa))|(?=(a)))/

"aafoo"=~/(?=foo)(?<=(a|aa))/ "aafoo"=~/(?=foo)(?|(?<=(a))|(?<=(aa)))/

You may have noticed that I had to use (?| ... ) in this conversion. That is because ((?<=a)) does not capture anything, so I could not convert

(?<=(a|aa))

into

((?<=a)|(?<=aa))

as it would not capture the contents. And converting it into

(?:(?<=(a))|(?<=(aa)))

would have meant two capturing buffers not one. (?| ...) resolves that problem. However it demonstrates the issues and subtleties that come up with considering alternative implementations.

Notice that the translation for (?<! ...) would be different. In that case we can ignore capturing buffers (if their contents matches then the pattern fails, so the capture buffer doesn't get populated) and turn

/(?=foo)(?!a|aa)/

into

/(?=foo)(?<!a)(?<!aa)/

So really we only care about this fine point of the semantics with positive lookbehind which captures, if there is no capture then it doesn't matter which we match.

Where this gets interesting, sort of, I think is the scenario that got you started on this, which is where case-insensitive matches can be implicitly variable length. So for instance /(?=\xDF)/i should be equivalent to /(?=\xDF|[sS][sS])/. The reason I said "sort of" is that it seems to me these questions only matter when the alternatives which would match overlap (eg could match each other), like (a|aa). Just guessing I would assume with unicode case folding there aren't any such cases. If that is true then we can forget those cases, if it isnt, eg, if there is some charact which when folded can match "X" and "XX" then we would have to make some decisions.

Anyway, what I am saying here is that unless we are going to leave this marked experimental until we totally change how lookbehind is implemented, changing it from going to max-left to min-left or min-left to max-left is really not going to "save the day". It just moves the bugs around.

But I am of the opinion that it is unlikely that we will do these changes, and if we do I would suggest that some kind of pragma to opt in to the new (or old) semantics would suffice. I also feel that going from 'max-left' to 'min-left' is likely to be more efficient on average, especially if you consider that we could use the AHOCORASICK/TRIE opcode to perform the match efficiently left to right for many cases.

I also think that the current implementation implies the least surprise, the normal rule of thumb is "leftmost longest". Consider the principals laid out in perlretut:

-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:

=over 4

=item *

Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.

=item *

Principle 1: In an alternation C<a|b|c...>, the leftmost alternative that allows a match for the whole regexp will be the one used.

=item *

Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and C<{n,m}> will in general match as much of the string as possible while still allowing the whole regexp to match.

=item *

Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

=back

As we have seen above, Principle 0 overrides the others. The regexp will be matched as early as possible, with the other principles determining how the regexp matches at that earliest character position. -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----

My view is that matching lookbehind from max-left to min-left, as we do, is the most aligned with the principles above. And given we aren't going to implement right to left DFA style matching any time soon, waiting for these inconsistencies to be resolved is going to mean that positive lookbehind stays experimental for a very long time, potentially forever.

Another point I think is relevant, which I mentioned earlier but I would like to call more attention to is that the only place where we should care about this at all is when there is a positive lookbehind which contains a capture buffer, either because it would change the content of the capture buffer, or because that captured text is used later via a backreference or both. If there is no capture buffer inside of the positive lookbehind it doesn't matter what it matches. So if you really felt like we had to keep the door open for changes in the future then I would say we should change the experimental status on this so the vlb warning is ONLY produced when you capture inside of a positive lookbehind. But I feel like it is unnecessary to do so.

Given that there are multiple reasonable interpretations for how lookbehind should match meaning there is no true "ideal" match order, and given the current implementation is the most compatible with the base principles of how matching works I am comfortable with signing this off as it is. If people cared about these semantic issues they would have raised them in the last three years. If we ever rewrite the regex engine enough to be able to offer a different implementation we can give people a way to choose which they want. Or even introduce new constructs so they can use both at the same time. Now that you have introduced (positive_lookbehind:...) we can easily add (dfa_positive_lookbehind:...) or something like it for the new behavior.

cheers, Yves [1] This demonstrates that positive lookbehind is not atomic and that we do backtrack into it: ./perl -Ilib -le'print "aaz"=~/(?<=(a|aa))\1z/ ? "yes:$1:$&" : "no"' yes:a:az ./perl -Ilib -le'print "aaz"=~/(?<=(aa|a))\1z/ ? "yes:$1:$&" : "no"' yes:a:az

demerphq commented 2 years ago

On Wed, 23 Feb 2022 at 04:17, demerphq @.***> wrote:

Correction. Where I said:

for instance /(?=\xDF)/i should be equivalent to /(?=\xDF|[sS][sS])/. The

reason I said "sort of" is that it seems to me these questions only matter when the

I meant:

for instance /(?<=\xDF)/i should be equivalent to /(?<=\xDF|[sS][sS])/. The

reason I said "sort of" is that it seems to me these questions only matter when the

Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq commented 2 years ago

On Mon, 21 Feb 2022 at 03:16, demerphq @.***> wrote:

Consider the pattern

/[a-z]+(?<!m+)/

I would expect this to be formally equivalent to

/[a-ln-z]+/

In a DFA construction. The mirror interpretation would be

/[a-z]*[a-ln-z]+/

I realized after playing with this a bit that I was wrong about these conversions and I now am inclined to think that there is no clear DFA construction for lookbehind.

[a-z]+(?<!m+)

would match "mmmmmmma" so it can't be the same as /[a-ln-z]+/, it would be the same as /[a-z]*[a-ln-z]+/

But I dont think this is a good argument for the mirror interpretation. Consider that the mirror interpretation, that is matching min-left to max-left, would produce as many errors as the max-left to min-left. Eg, with a{1,2} the mirror interpretation would match "a" before it would match "aa", which would be wrong.

The length restrictions make it something else yet again.

The length restriction is enforced so this concern is resolved for me.

cheers, yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

khwilliamson commented 2 years ago

On 2/22/22 20:17, Yves Orton wrote:

I am curious on what grounds you say "ideally" here?

I think now I was wrong. I think we have to fill groups L-R, so that $1 corresponds to the group begun by the leftmost left parenthesis; and that means we can't do a mirror image.

So maybe the semantics are currently fine.

demerphq commented 2 years ago

On Tue, 22 Feb 2022 at 15:05, Karl Williamson @.***> wrote:

On 2/20/22 21:34, Yves Orton wrote:

On Mon, 21 Feb 2022, 12:11 Ricardo Signes, @.***> wrote:

Thanks, you two. I have been pursuing this exiting experimental only because it seemed to be not moving and to work. I am not in a rush, and will back off until post-5.36!

I have had a chance to review this in more detail and I am very pleased to say I was very wrong. Yesterday I could have sworn unlimited quantifiers were allowed in lookbehind, but I guess I tested something else, as I said in the commit message my brain was mush by the end of the day. Every case I expected to be broken seems to be covered. (Yay Karl!) Provided my PR is applied and we add a bunch more tests and no further surprises are revealed I think we will be good for 5.36 contrary to what I said earlier.

I apologize to Karl for doubting him. He did a great job with the max length checks.

I will follow up with more tests and maybe a slight optimization but I think we are good!

Yves

Thanks, but I'm not so sure it's ready to be de-experimentalized.

Ok, well I have removed the de-experimentalization patch from https://github.com/Perl/perl5/pull/19442

IMO we need to get that merged in time for 5.36.0 whatever we do. The current implementation is just buggy.

Yves

demerphq commented 2 years ago

On Wed, 23 Feb 2022 at 04:42, Karl Williamson @.***> wrote:

On 2/22/22 20:17, Yves Orton wrote:

I am curious on what grounds you say "ideally" here?

I think now I was wrong. I think we have to fill groups L-R, so that $1 corresponds to the group begun by the leftmost left parenthesis; and that means we can't do a mirror image.

We will violate the expectations of alternation with our current base implementation whatever we do. Changing the order we try things doesn't help. With max-left to min-left we will do /(?<=(a|aa))/ and /(?<=(a{1,2}?))/ wrong.. With min-left to max-left we will do /(?<=(aa|a))/ and /(?<=(a{1,2}))/ wrong.

We can't fix this problem by changing the order we try things. We have to change things entirely.

If you are really concerned we can make the experimental flag trigger only when there is variable length positive lookbehind that contains a capture buffer. All the other variants of lookbehind don't care.

So maybe the semantics are currently fine.

I think the current semantics are the least surprising of the options available to us without forcing a complete rewrite, and the ones that are most compliant with the base principles of matching, that we match the leftmost thing first.

So I am fine with us removing the experimental status.

I have removed the de experiment patch from

https://github.com/Perl/perl5/pull/19442

and pushed the de experiment patch as:

https://github.com/Perl/perl5/pull/19454

so that I can start work on a patch to make the experimental warning trigger only when lookbehind is variable length AND it includes capturing. Then we have some options.

Yves

demerphq commented 2 years ago

On Wed, 23 Feb 2022 at 04:54, demerphq @.***> wrote:

I have removed the de experiment patch from

https://github.com/Perl/perl5/pull/19442

Please review and let me know if you have any objections to it being merged. I have other patches queuing which expect to apply on top of it.

and pushed the de experiment patch as:

https://github.com/Perl/perl5/pull/19454

so that I can start work on a patch to make the experimental warning trigger only when lookbehind is variable length AND it includes capturing. Then we have some options.

I have pushed the above as:

https://github.com/Perl/perl5/pull/19455

as an alternative to

https://github.com/Perl/perl5/pull/19454

cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds commented 2 years ago

My initial concern here was that it didn't feel like there had been much discussion of the semantics. @demerphq has convinced me at least that the current semantics are coherent enough to be viable (though I'm not as completely convinced that they are necessary or ideal).

My newer concern is that a pretty huge issue such as #19168 was not found by people using it in the wild, which suggests to me it has had minimal take-up even for the //i context that motivated @khwilliamson to add it - I'm not sure on what basis we can declare the experiment successful if nobody has used it.

@jkeenan asked at the top of this issue:

Do we have any criteria for determining whether an experiment has been successful or not (other than "people haven't complained about it")?

.. which I don't think ever got an answer.

demerphq commented 2 years ago

On Wed, 23 Feb 2022 at 12:49, Hugo van der Sanden @.***> wrote:

My initial concern here was that it didn't feel like there had been much discussion of the semantics. @demerphq https://github.com/demerphq has convinced me at least that the current semantics are coherent enough to be viable (though I'm not as completely convinced that they are necessary or ideal).

Fair and reasonable.

My newer concern is that a pretty huge issue such as #19168 https://github.com/Perl/perl5/issues/19168 was not found by people using it in the wild, which suggests to me it has had minimal take-up even for the //i context that motivated @khwilliamson https://github.com/khwilliamson to add it - I'm not sure on what basis we can declare the experiment successful if nobody has used it.

I think the /i case is handled differently than the true alternation case. FWIW, people did know, for instance

https://www.regular-expressions.info/lookaround.html

mentions that our implementation is buggy. But they didnt tell us I guess. :-(

@jkeenan https://github.com/jkeenan asked at the top of this issue:

Do we have any criteria for determining whether an experiment has been successful or not (other than "people haven't complained about it")?

.. which I don't think ever got an answer.

Fair question. I dont know what to say.

Cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

khwilliamson commented 2 years ago

On 2/23/22 05:00, Yves Orton wrote:

On Wed, 23 Feb 2022 at 12:49, Hugo van der Sanden @.***> wrote:

My initial concern here was that it didn't feel like there had been much discussion of the semantics. @demerphq https://github.com/demerphq has convinced me at least that the current semantics are coherent enough to be viable (though I'm not as completely convinced that they are necessary or ideal).

Fair and reasonable.

My newer concern is that a pretty huge issue such as #19168 https://github.com/Perl/perl5/issues/19168 was not found by people using it in the wild, which suggests to me it has had minimal take-up even for the //i context that motivated @khwilliamson https://github.com/khwilliamson to add it - I'm not sure on what basis we can declare the experiment successful if nobody has used it.

Anytime one has a sequence 'ss' (any combination of caps) as part of a lookbehind assertion and are under /iu rules, you are implicitly using a variable length lookbehind. Such sequences are very common. Tickets became closable upon the introduction of vlb, and no new ones have since been generated. I argue that that indicates this feature has had significant field testing in that regard.

I think the /i case is handled differently than the true alternation case. FWIW, people did know, for instance

https://www.regular-expressions.info/lookaround.html

mentions that our implementation is buggy. But they didnt tell us I guess.

:-(

I just sent email to the site asking for failing test cases

@jkeenan https://github.com/jkeenan asked at the top of this issue:

Do we have any criteria for determining whether an experiment has been successful or not (other than "people haven't complained about it")?

.. which I don't think ever got an answer.

Fair question. I dont know what to say.

Many people, myself generally included, will avoid using an experimental feature, to avoid being on the bleeding edge.

My view is that at some point one has to declare it as accepted absent negative feedback This means that we commit to supporting it if buggy.

Cheers, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/issues/18756#issuecomment-1048709796, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DH6ZXR6QOEF4A6TFSPLU4TD5VANCNFSM434YFM2A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>