Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 540 forks source link

End of string + 0-Width assertion oddity #2406

Open p5pRT opened 24 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3762 (status was 'stalled')

Searchable as RT3762$

p5pRT commented 24 years ago

From @btilly

Created by ben_tilly@hotmail.com

I actually know the design decisions that led to this. I still think that this is a bug though​:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the "\n" at the end\, and once from matching a 0-width assertion. In the finest DWIM tradition I think that after matching $ you should not be able to match a zero-width assertion at that point again. Doing otherwise is likely to be unexpected. (Except when done by smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just disallowing 0-width assertions matching twice at the same spot\, disallow having two REs match ending at the same position in a /g. That rule may be simpler to implement and likewise removes other surprises\, like /x*/ matching both at and after an 'x'.

Perl Info ``` Site configuration information for perl 5.00503: Configured by tilly at Fri May 28 18:22:31 EDT 1999. Summary of my perl5 (5.0 patchlevel 5 subversion 3) configuration: Platform: osname=linux, osvers=2.0.34, archname=i386-linux uname='linux mcrubs1305 2.0.34 #1 tue aug 25 19:28:36 edt 1998 i586 unknown ' hint=recommended, useposix=true, d_sigaction=define usethreads=undef useperlio=undef d_sfio=undef Compiler: cc='gcc', optimize='-O2', gccversion=2.7.2.3 cppflags='-Dbool=char -DHAS_BOOL -I/usr/local/include' ccflags ='-Dbool=char -DHAS_BOOL -I/usr/local/include' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -ldb -ldl -lm -lc -lposix -lcrypt libc=, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl 5.00503: /usr/local/lib/perl5/5.00503/i386-linux /usr/local/lib/perl5/5.00503 /usr/local/lib/perl5/site_perl/5.005/i386-linux /usr/local/lib/perl5/site_perl/5.005 . Environment for perl 5.00503: HOME=/home/tilly LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:.:. PERL_BADLANG (unset) SHELL=/bin/bash ________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com ```
p5pRT commented 24 years ago

From @ysth

In article \LAW2\-F130FRaqbC8wSA000014bd@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the "\n" at the end\, and once from matching a 0-width assertion. In the finest DWIM tradition I think that after matching $ you should not be able to match a zero-width assertion at that point again. Doing otherwise is likely to be unexpected. (Except when done by smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just disallowing 0-width assertions matching twice at the same spot\, disallow having two REs match ending at the same position in a /g. That rule may be simpler to implement and likewise removes other surprises\, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with 0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's\, 0 \n's and EOL).

The following\, on the other hand\, does seem a little odd​:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g\," matches!\n"'

p5pRT commented 24 years ago

From @btilly

Yitzchak Scott-Thoennes wrote​:

In article \LAW2\-F130FRaqbC8wSA000014bd@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the "\n" at the end\, and once from matching a 0-width assertion. In the finest DWIM tradition I think that after matching $ you should not be able to match a zero-width assertion at that point again. Doing otherwise is likely to be unexpected. (Except when done by smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just disallowing 0-width assertions matching twice at the same spot\, disallow having two REs match ending at the same position in a /g. That rule may be simpler to implement and likewise removes other surprises\, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with 0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's\, 0 \n's and EOL).

Define problem? It is documented in perlre\, in a section which is thoughtfully described as difficult and needing a rewrite. (True both in 5.005_03 and 5.6.0.)

Now why doesn't it find the second match a few dozen more times? No real reason except that an exception has been made for it. I happen to think that that the exception as it stands is a little more confusing than it needs to be.

The following\, on the other hand\, does seem a little odd​:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g\," matches!\n"'

It works as designed. Match zero times. Try again\, can't because of the exception mentioned above. Match one char. Try again\, can match a zero-width assertion. Try again\, finally fail because of the exception.

*shrug*

Ben ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 24 years ago

From @ysth

In article \LAW2\-F54lLRa8174jDJ0000488b@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

Yitzchak Scott-Thoennes wrote​:

In article \LAW2\-F130FRaqbC8wSA000014bd@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

perl -e '$str = "Hello World\n"; $str =~ s/\r?\n?$/\n/g; print $str;'

Try it. $str winds up with two returns. Once from matching the "\n" at the end\, and once from matching a 0-width assertion. In the finest DWIM tradition I think that after matching $ you should not be able to match a zero-width assertion at that point again. Doing otherwise is likely to be unexpected. (Except when done by smartasses such as myself. :-)

Another idea that Tye McQueen tossed at me is that instead of just disallowing 0-width assertions matching twice at the same spot\, disallow having two REs match ending at the same position in a /g. That rule may be simpler to implement and likewise removes other surprises\, like /x*/ matching both at and after an 'x'.

I don't understand the problem. It's matching once at pos 11 (with 0 \r's and 1 \n followed by EOL) and once at pos 12 (0 \r's\, 0 \n's and EOL).

Define problem? It is documented in perlre\, in a section which is thoughtfully described as difficult and needing a rewrite. (True both in 5.005_03 and 5.6.0.)

I guess I should have said I don't understand your proposed solution.

Now why doesn't it find the second match a few dozen more times? No real reason except that an exception has been made for it. I happen to think that that the exception as it stands is a little more confusing than it needs to be.

How would you simplify it? From your phrase "after matching $ you should not be able to match a zero-width assertion at that point again" it's not clear to me what you are proposing.

The following\, on the other hand\, does seem a little odd​:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g\," matches!\n"'

It works as designed. Match zero times. Try again\, can't because of the exception mentioned above. Match one char. Try again\, can match a zero-width assertion. Try again\, finally fail because of the exception.

Yes\, of course. Now I don't see what was confusing me. :) I guess I was just getting too caught up in the -Dr output to really think.

p5pRT commented 24 years ago

From @btilly

Yitzchak Scott-Thoennes wrote​:

In article \LAW2\-F54lLRa8174jDJ0000488b@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

Yitzchak Scott-Thoennes wrote​: [..] Now why doesn't it find the second match a few dozen more times? No real reason except that an exception has been made for it. I happen to think that that the exception as it stands is a little more confusing than it needs to be.

How would you simplify it? From your phrase "after matching $ you should not be able to match a zero-width assertion at that point again" it's not clear to me what you are proposing.

Well my initial thought is that someone who asked to match $ is unlikely to want to match again. This came up because someone was playing around and got confused that /x*$/g matched 'x' twice.

However I believe that a simpler rule is\, "two matches cannot end at the same place". That covers the current rule about two zero width assertions\, and makes\, eg\, s/ */ /g more likely to do what most people would expect.

The following\, on the other hand\, does seem a little odd​:

perl -Dr -e '$str="Hello World\n"; print $str=~s/\n??$/\n/g\," matches!\n"'

It works as designed. Match zero times. Try again\, can't because of the exception mentioned above. Match one char. Try again\, can match a zero-width assertion. Try again\, finally fail because of the exception.

Yes\, of course. Now I don't see what was confusing me. :) I guess I was just getting too caught up in the -Dr output to really think.

I ran it on a Perl without debugging. Made it easier. :-)

Cheers\, Ben

PS Sorry for the resend\, forgot to cc p5p the first time. :-( ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 24 years ago

From @ysth

In article \LAW2\-F146PaatKGS7hD00003113@​hotmail\.com\, "Ben Tilly" \ben\_tilly@​hotmail\.com wrote​:

However I believe that a simpler rule is\, "two matches cannot end at the same place". That covers the current rule about two zero width assertions\, and makes\, eg\, s/ */ /g more likely to do what most people would expect.

Let me make sure I'm understanding you. So you would want this​:

[D​:\home\sthoenna]perl -wle "print map qq​:\<$_>​:\, 'abc'=~/.??/g" \<>\\<>\\<>\\<>

to instead output

\<>\\\

?? I'm not sure that's less unexpected.

p5pRT commented 24 years ago

From @btilly

Yitzchak Scott-Thoennes wrote​:

In article \LAW2\-F146PaatKGS7hD00003113@&#8203;hotmail\.com\, "Ben Tilly" \ben\_tilly@&#8203;hotmail\.com wrote​:

However I believe that a simpler rule is\, "two matches cannot end at the same place". That covers the current rule about two zero width assertions\, and makes\, eg\, s/ */ /g more likely to do what most people would expect.

Let me make sure I'm understanding you. So you would want this​:

[D​:\home\sthoenna]perl -wle "print map qq​:\<$_>​:\, 'abc'=~/.??/g" \<>\\<>\\<>\\<>

to instead output

\<>\\\

?? I'm not sure that's less unexpected.

That would be correct\, but I am dubious that /.??/g has any particularly natural meaning.

Cheers\, Ben ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ben Tilly wrote​:

Yitzchak Scott-Thoennes wrote​:

In article \LAW2\-F146PaatKGS7hD00003113@&#8203;hotmail\.com\, "Ben Tilly" \ben\_tilly@&#8203;hotmail\.com wrote​:

However I believe that a simpler rule is\, "two matches cannot end at the same place". That covers the current rule about two zero width assertions\, and makes\, eg\, s/ */ /g more likely to do what most people would expect.

Let me make sure I'm understanding you. So you would want this​:

[D​:\home\sthoenna]perl -wle "print map qq​:\<$_>​:\, 'abc'=~/.??/g" \<>\\<>\\<>\\<>

to instead output

\<>\\\

?? I'm not sure that's less unexpected.

That would be correct\, but I am dubious that /.??/g has any particularly natural meaning.

What about

  while ($text =~ /$token/g) {   print length($1) if /\G($optional_token)/g;   }

?

If $optional_token matches "" then this would fail? That doesn't seem as useful as the current rule.

Note that you can't fix this by just resetting the ends-at-the-same-place flag between ops because then this​:

  print length($1) while /\G($optional_token)/g;

would loop forever.

p5pRT commented 24 years ago

From @btilly

Rick Delaney wrote​:

Ben Tilly wrote​:

[...] That would be correct\, but I am dubious that /.??/g has any particularly natural meaning.

What about

while \($text =~ /$token/g\) \{
    print length\($1\) if /\\G\($optional\_token\)/g;
\}

?

If $optional_token matches "" then this would fail? That doesn't seem as useful as the current rule.

What about testing several optional tokens in a row at the same place? The current rule already breaks that!

Is it better to break assumptions early\, or late?

Note that you can't fix this by just resetting the ends-at-the-same-place flag between ops because then this​:

print length\($1\) while /\\G\($optional\_token\)/g;

would loop forever.

Yup. Perhaps I should just patch the current explanation to move it up and clarify? Given that the current behaviour is already documented\, I am probably in the wrong to have suggested anything else. :-( ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 24 years ago

From @ysth

In article \LAW2\-F49HXG2U9iTb8k00004ca5@&#8203;hotmail\.com\, "Ben Tilly" \ben\_tilly@&#8203;hotmail\.com wrote​:

Yitzchak Scott-Thoennes wrote​:

Let me make sure I'm understanding you. So you would want this​:

[D​:\home\sthoenna]perl -wle "print map qq​:\<$_>​:\, 'abc'=~/.??/g" \<>\\<>\\<>\\<>

to instead output

\<>\\\

?? I'm not sure that's less unexpected.

That would be correct\, but I am dubious that /.??/g has any particularly natural meaning.

Agreed. It is good for showing people how the exception works\, though.

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Ben Tilly wrote​:

while \($text =~ /$token/g\) \{
    print length\($1\) if /\\G\($optional\_token\)/g;
\}

If $optional_token matches "" then this would fail? That doesn't seem as useful as the current rule.

What about testing several optional tokens in a row at the same place? The current rule already breaks that!

Good point.

Is it better to break assumptions early\, or late?

Note that you can't fix this by just resetting the ends-at-the-same-place flag between ops because then this​:

print length\($1\) while /\\G\($optional\_token\)/g;

would loop forever.

Yup. Perhaps I should just patch the current explanation to move it up and clarify? Given that the current behaviour is already documented\, I am probably in the wrong to have suggested anything else. :-(

You could always suggest a pragma. I can see value in each of the three behaviours mentioned.

p5pRT commented 24 years ago

From @btilly

Rick Delaney wrote​:

Ben Tilly wrote​:

while \($text =~ /$token/g\) \{
    print length\($1\) if /\\G\($optional\_token\)/g;
\}

If $optional_token matches "" then this would fail? That doesn't seem as useful as the current rule.

What about testing several optional tokens in a row at the same place? The current rule already breaks that!

Good point.

[...]

Yup. Perhaps I should just patch the current explanation to move it up and clarify? Given that the current behaviour is already documented\, I am probably in the wrong to have suggested anything else. :-(

You could always suggest a pragma. I can see value in each of the three behaviours mentioned.

Anyone with enough tuits to use the pragma IMO should be assumed to have enough tuits to assign to pos() or write an RE that doesn't match 0-width where you don't want to.

OTOH the pragma that I *really* want to see is one to force the RE engine to find how many matches it could have found total\, and warn if that number seems excessive. This would be very useful for testing scripts for poorly written REs...

No idea how hard it would be though.

Cheers\, Ben ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 24 years ago

From @vanstyn

In \LAW2\-F64rkgeSvKHJz6000063e1@&#8203;hotmail\.com\, "Ben Tilly" writes​: :OTOH the pragma that I *really* want to see is one to force the :RE engine to find how many matches it could have found total\, and :warn if that number seems excessive. This would be very useful :for testing scripts for poorly written REs... : :No idea how hard it would be though.

I'm not sure I understand what you mean by 'how many matches it could have found'\, but guessing​:

The trouble is that it is quite reasonable for the main engine to report 10^10 theoretically possible matches\, while the optimiser reports that depending on the data it will quickly throw out anywhere from 0 to 10^10 of them. I'm not sure that it is possible to go from there to reporting a useful number.

Hugo

p5pRT commented 24 years ago

From @btilly

Hugo wrote​:

In \LAW2\-F164gxdvzhTLSm00006166@&#8203;hotmail\.com\, "Ben Tilly" writes​: ​:>I'm not sure I understand what you mean by 'how many matches it ​:>could have found'\, but guessing​:

I guess I guessed wrong.

​:OK\, here is an idea of how to do it. Have a pragma that forces ​:any run of the RE engine to do a full trial run then a real run. ​:For the trial run put at the end of the RE a custom escape (see ​:the custom engine stuff in perlre) which always fails but keeps a ​:counter of how many times it was reached\, bombing out if it passes ​:a fixed limit.

Something like (?{ ++$cnt > $limit ? die : '' })\, then? Though it is unfortunate\, particularly in the context\, that the code is hit twice for every zero-length match. :(

Yeah\, except when it is done go back and try again.

​:Slow\, inefficient\, etc. But useful for smoking out poorly written ​:REs in test suites. :-)

I had misunderstood​: I thought you were talking about the number of comparisons performed\, to catch things like exponential failure cases.

That is *exactly* what I am talking about. If you run it with expected data in the mode that I am talking about then you will get a pretty good handle on how slow the failure case would be without having to track down each RE and code up a case that would fail on that RE.

I would imagine this is not a feature most people would consider using until they already know they have a problem\, at which point there is an array of other debugging mechanisms that are probably as\, if not more\, useful. I'm probably still missing the point.

Unless it was mentioned as a wise test to proactively put into your standard benchmark suite...

The idea is to have an easy way to make Perl search for potentially inefficient REs\, rather than encountering them by trial and error.

Cheers\, Ben

PS Hugo\, sorry for the resend. Forgot to cc p5p. ________________________________________________________________________ Get Your Private\, Free E-mail from MSN Hotmail at http​://www.hotmail.com

p5pRT commented 14 years ago

@chorny - Status changed from 'open' to 'stalled'