Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

split() in do{} and eval{} blocks lacks implicit limit #13104

Open p5pRT opened 11 years ago

p5pRT commented 11 years ago

Migrated from rt.perl.org#118879 (status was 'open')

Searchable as RT118879$

p5pRT commented 11 years ago

From agalama@mailfish.de

perldoc split says​: "when assigning to a list\, if LIMIT is omitted (or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list".

Apparently\, this is not the case within a do{} or eval{} block​:

use Data​::Dumper;

# implicit limit = 3 in accordance with the docs # -> empty trailing field preserved ($a\, $b) = split /\,/\, 'a\,'; print Dumper($b); # empty string

# implicit limit = 0 contrary to the docs # -> empty trailing field stripped ($a\, $b) = do { split /\,/\, 'a\,' }; print Dumper($b); # undef

($a\, $b) = eval { split /\,/\, 'a\,' }; print Dumper($b); # undef

__END__


Flags​:   category=core   severity=low


This perlbug was built using Perl 5.14.4 in the Fedora build system. It is being executed now by Perl 5.14.4 - Thu Apr 11 12​:56​:04 UTC 2013.

Site configuration information for perl 5.14.4​:

Configured by Red Hat\, Inc. at Thu Apr 11 12​:56​:04 UTC 2013.

Summary of my perl5 (revision 5 version 14 subversion 4) configuration​:

  Platform​:   osname=linux\, osvers=2.6.32-358.2.1.el6.x86_64\, archname=x86_64-linux-thread-multi   uname='linux buildvm-11.phx2.fedoraproject.org 2.6.32-358.2.1.el6.x86_64 #1 smp wed feb 20 12​:17​:37 est 2013 x86_64 x86_64 x86_64 gnulinux '   config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Dccdlflags=-Wl\,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wl\,-z\,relro -DDEBUGGING=-g -Dversion=5.14.4 -Dmyhostname=localhost -Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red Hat\, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=define\, usemultiplicity=define   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=define\, use64bitall=define\, uselongdouble=undef   usemymalloc=n\, bincompat5005=undef   Compiler​:   cc='gcc'\, ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\,   optimize='-O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic'\,   cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'   ccversion=''\, gccversion='4.7.2 20120921 (Red Hat 4.7.2-2)'\, gccosandvers=''   intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16   ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=8\, prototype=define   Linker and Libraries​:   ld='gcc'\, ldflags =' -fstack-protector'   libpth=/usr/local/lib64 /lib64 /usr/lib64   libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat   perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc   libc=\, so=so\, useshrplib=true\, libperl=libperl.so   gnulibc_version='2.15'   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,--enable-new-dtags -Wl\,-rpath\,/usr/lib64/perl5/CORE'   cccdlflags='-fPIC'\, lddlflags='-shared -O2 -g -pipe -Wall -Wp\,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -Wl\,-z\,relro '

Locally applied patches​:


@​INC for perl 5.14.4​:   /home/xxx/src/perl   /usr/local/lib64/perl5   /usr/local/share/perl5   /usr/lib64/perl5/vendor_perl   /usr/share/perl5/vendor_perl   /usr/lib64/perl5   /usr/share/perl5   .


Environment for perl 5.14.4​:   HOME=/home/xxx   LANG=en_US.UTF-8   LANGUAGE (unset)   LD_LIBRARY_PATH (unset)   LOGDIR (unset)

PATH=/usr/lib64/qt-3.3/bin​:/usr/local/bin​:/usr/bin​:/usr/local/sbin​:/usr/sbin​:/home/xxx/.local/bin​:/home/xxx/bin   PERL5LIB=/home/xxx/src/perl   PERL_BADLANG (unset)   SHELL=/bin/bash

p5pRT commented 11 years ago

From @jkeenan

On Thu Jul 11 11​:35​:33 2013\, galama wrote​:

perldoc split says​: "when assigning to a list\, if LIMIT is omitted (or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list".

Apparently\, this is not the case within a do{} or eval{} block​:

use Data​::Dumper;

# implicit limit = 3 in accordance with the docs # -> empty trailing field preserved ($a\, $b) = split /\,/\, 'a\,'; print Dumper($b); # empty string

# implicit limit = 0 contrary to the docs # -> empty trailing field stripped ($a\, $b) = do { split /\,/\, 'a\,' }; print Dumper($b); # undef

($a\, $b) = eval { split /\,/\, 'a\,' }; print Dumper($b); # undef

__END__

Confirmatory code​:

$ perl -MData​::Dumper -E '$Data​::Dumper​::Indent=0;($a\, $b) = split /\,/\, "a\,"; say Dumper ([$a\,$b]);' $VAR1 = ['a'\,''];

$ perl -MData​::Dumper -E '$Data​::Dumper​::Indent=0;($a\, $b) = do { split /\,/\, "a\," }; say Dumper ([$a\,$b]);' $VAR1 = ['a'\,undef];

$ perl -MData​::Dumper -E '$Data​::Dumper​::Indent=0;($a\, $b) = eval { split /\,/\, "a\," }; say Dumper ([$a\,$b]);' $VAR1 = ['a'\,undef];

p5pRT commented 11 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 11 years ago

From @rjbs

* Anand Galama \perlbug\-followup@​perl\.org [2013-07-11T14​:35​:33]

perldoc split says​: "when assigning to a list\, if LIMIT is omitted (or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list".

Apparently\, this is not the case within a do{} or eval{} block​:

I imagine the easiest and more appropriate fix is to say "when immediately assigning" rather thant o try to address every form of indirection.

-- rjbs

p5pRT commented 11 years ago

From @cpansprout

On Sun Jul 14 05​:44​:46 2013\, perl.p5p@​rjbs.manxome.org wrote​:

* Anand Galama \perlbug\-followup@​perl\.org [2013-07-11T14​:35​:33]

perldoc split says​: "when assigning to a list\, if LIMIT is omitted (or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list".

Apparently\, this is not the case within a do{} or eval{} block​:

I imagine the easiest and more appropriate fix is to say "when immediately assigning" rather thant o try to address every form of indirection.

I always thought this was just an optimisation and it was only documented so that one could intentionally trigger it.

But as the OP points out\, it can change undef into an empty string.

Do we want to make it so that this optimisation is purely an optimisation?

--

Father Chrysostomos

p5pRT commented 11 years ago

From agalama@mailfish.de

On Sun Jul 14 05​:44​:46 2013\, perl.p5p@​rjbs.manxome.org wrote​:

I imagine the easiest and more appropriate fix is to say "when immediately assigning" rather thant o try to address every form of indirection.

Thanks\, this just lead me to another even simpler example. :-)

($n\, $a\, $b) = (1\, split /\,/\, 'a\,');

This also makes $b undef just like in the do{} and eval{} case but I would still call it an "immediate" assignment.

p5pRT commented 11 years ago

From @ap

* Father Chrysostomos via RT \perlbug\-followup@​perl\.org [2013-07-14 16​:00]​:

Do we want to make it so that this optimisation is purely an optimisation?

I think it should be.

But how to go about it? Do you want to expose this capability in some way so the user can invoke it explicitly? If not\, would you remove the paragraph about optimisation from perlfunc?

This is also a very wide-spread usage\, undoubtedly relied on unwittingly by a *lot* of code. How to detect that? Regular deprecation does not seem the answer\, since this deprecation warning would pour forth from perl like floodwaters.

Off the cuff the only thing I can think of is for `split` to return a special magicalised empty string during the deprecation cycle\, which would trigger the deprecation warning when used in
 certain contexts. (Non-boolean?)

(No doubt this would also clean up code that is inadvertently buggy though.)

* Anand Galama via RT \perlbug\-followup@​perl\.org [2013-07-15 05​:55]​:

($n\, $a\, $b) = (1\, split /\,/\, 'a\,');

This also makes $b undef just like in the do{} and eval{} case but I would still call it an "immediate" assignment.

Does this do it for you?

Inline Patch ```diff diff --git i/pod/perlfunc.pod w/pod/perlfunc.pod index 18ecd40..c108733 100644 --- i/pod/perlfunc.pod +++ w/pod/perlfunc.pod @@ -6959,8 +6959,9 @@ produces the output 'a:b:c', but the following: produces the output 'a:b:c:::'. In time-critical applications, it is worthwhile to avoid splitting -into more fields than necessary. Thus, when assigning to a list, -if LIMIT is omitted (or zero), then LIMIT is treated as though it +into more fields than necessary. Thus, when the right-hand side of +a list assignment consists only of a C, and LIMIT is omitted +(or zero), then LIMIT is treated as though it were one larger than the number of variables in the list; for the following, LIMIT is implicitly 3: ```
p5pRT commented 11 years ago

From @ap

* Aristotle Pagaltzis \pagaltzis@​gmx\.de [2013-07-16 03​:00]​:

(No doubt this would also clean up code that is inadvertently buggy though.)

(By which I mean code that’s currently written to expect it will get undef for missing trailing fields\, but triggers the optimisation and therefore doesn’t.)

p5pRT commented 11 years ago

From @cpansprout

On Mon Jul 15 17​:58​:37 2013\, aristotle wrote​:

* Father Chrysostomos via RT \perlbug\-followup@​perl\.org [2013-07-14 16​:00]​:

Do we want to make it so that this optimisation is purely an optimisation?

I think it should be.

But how to go about it? Do you want to expose this capability in some way so the user can invoke it explicitly?

An explicit third argument does that\, does it not?

If not\, would you remove the paragraph about optimisation from perlfunc?

Instead of explaining how the optimisation works\, just state the circumstance in which it occurs.

Then we can fix #103212 as well.

This is also a very wide-spread usage\, undoubtedly relied on unwittingly by a *lot* of code. How to detect that? Regular deprecation does not seem the answer\, since this deprecation warning would pour forth from perl like floodwaters.

Off the cuff the only thing I can think of is for `split` to return a special magicalised empty string during the deprecation cycle\, which would trigger the deprecation warning when used in
 certain contexts. (Non-boolean?)

If the deprecation warning could be spelt ‘Use of uninitialized value’\, I think we already have such a scalar. :-)

Or do you mean one for which defined($_) returns true\, but which is treated as undef elsewhere? Yes\, that would be possible.

But how much code is really relying on that single empty string?

$ ./perl -Ilib -e '($a\,$b\,$c\,$d\,$e\,$f) = split //\, "abc"; warn "["\,$_//"undef"\,"]" for $a\,$b\,$c\,$d\,$e\,$f' [a] at -e line 1. [b] at -e line 1. [c] at -e line 1. [] at -e line 1. [undef] at -e line 1. [undef] at -e line 1.

(No doubt this would also clean up code that is inadvertently buggy though.)

* Anand Galama via RT \perlbug\-followup@​perl\.org [2013-07-15 05​:55]​:

($n\, $a\, $b) = (1\, split /\,/\, 'a\,');

This also makes $b undef just like in the do{} and eval{} case but I would still call it an "immediate" assignment.

Does this do it for you?

diff --git i/pod/perlfunc.pod w/pod/perlfunc.pod index 18ecd40..c108733 100644 --- i/pod/perlfunc.pod +++ w/pod/perlfunc.pod @​@​ -6959\,8 +6959\,9 @​@​ produces the output 'a​:b​:c'\, but the following​: produces the output 'a​:b​:c​::​:'.

In time-critical applications\, it is worthwhile to avoid splitting -into more fields than necessary. Thus\, when assigning to a list\, -if LIMIT is omitted (or zero)\, then LIMIT is treated as though it +into more fields than necessary. Thus\, when the right-hand side of +a list assignment consists only of a C\\, and LIMIT is omitted +(or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list; for the following\, LIMIT is implicitly 3​:

--

Father Chrysostomos

p5pRT commented 11 years ago

From @ap

* Father Chrysostomos via RT \perlbug\-followup@​perl\.org [2013-07-16 03​:15]​:

On Mon Jul 15 17​:58​:37 2013\, aristotle wrote​:

* Father Chrysostomos via RT \perlbug\-followup@​perl\.org [2013-07-14 16​:00]​:

Do we want to make it so that this optimisation is purely an optimisation?

I think it should be.

But how to go about it? Do you want to expose this capability in some way so the user can invoke it explicitly?

An explicit third argument does that\, does it not?

No?

Now I’m not sure we’re even talking about the same thing.

An explicit third argument tells `split` to stop splitting at the given number of fields\, *and* return a trailing empty string field if any.

The user can omit the third argument (and trip up the optimisation) to make `split` *not* stop splitting *and* *not* return any trailing empty string field.

With the current interface there is no way for the user to ask for one of these (stop splitting at a certain field) *but not* the other (return a trailing empty string field).

What I understood you to be asking is whether

  perl -E '($n\, $a\, $b) = (1\, split /\,/\, "a\,"); say 1 if defined $b'   perl -E ' ($a\, $b) = (split /\,/\, "a\,"); say 1 if defined $b'

should both be silent (where currently the latter isn’t) while at the same time the `split` on the second line should know to stop splitting at the third field\, unlike the one on the first line (as is currently true).

In other words\, should `split` be able to do something (stop splitting *but not* return a trailing empty field) that the user can’t currently explicitly ask it for?

Now when you deparse `($a\,$b) = split//`\, it actually says this​:

  ($a\, $b) = split(//\, $_\, 3);

If the optimisation were made transparent\, then deparsing the `split` like this would be a lie\, because if the user wrote that\, she would get a subtly different behaviour than if the optimiser makes it like that.

Either the optimisation would have to trigger in some other way (within `pp_split`?) and become invisible at the op tree level – i.e. deparsing that statement would yield this​:

  ($a\, $b) = split(//\, $_);

If that is not the case\, and the optimisation is reflected in the op tree\, then the interface to `split` would have to change such that when the user writes the same code as Deparse outputs\, she gets the exact same behaviour (i.e. it stops splitting *but doesn’t* return a trailing empty string).

This is also a very wide-spread usage\, undoubtedly relied on unwittingly by a *lot* of code. How to detect that? Regular deprecation does not seem the answer\, since this deprecation warning would pour forth from perl like floodwaters.

Off the cuff the only thing I can think of is for `split` to return a special magicalised empty string during the deprecation cycle\, which would trigger the deprecation warning when used in
 certain contexts. (Non-boolean?)

If the deprecation warning could be spelt ‘Use of uninitialized value’\, I think we already have such a scalar. :-)

I am *not* proposing to deprecate the absence of a warning by warning that in the future an undef warning will be emitted. :-P

Or do you mean one for which defined($_) returns true\, but which is treated as undef elsewhere? Yes\, that would be possible.

Yes\, that is what I was thinking of.

Very specifically the case of `defined $trailing_field` is what would need to deprecation-warn\, even if nothing else does\, because that is unambiguously a case that would change behaviour after patching `split`.

Currently that expression returns true\, after the change it would return false – so any conditional that tests definedness would take the other branch in the future. This is a change in the behaviour of user code. If anything requires a deprecation warning\, that does.

Other uses of $trailing_field are murkier in whether they will cause any change in behaviour. I don’t know exactly how trigger-happy the warning should be. I am still trying to reason it out methodically\, which is not my strength.

But how much code is really relying on that single empty string?

$ ./perl -Ilib -e '($a\,$b\,$c\,$d\,$e\,$f) = split //\, "abc"; warn "["\,$_//"undef"\,"]" for $a\,$b\,$c\,$d\,$e\,$f' [a] at -e line 1. [b] at -e line 1. [c] at -e line 1. [] at -e line 1. [undef] at -e line 1. [undef] at -e line 1.

I don’t know. I’m not worried about the total amount in and of itself. It is more the sense that the amount of code relying on this might not be entirely trivial\, even if it is small\, along with the sense that most reliance on this behaviour is inadvertent. That is to say\, if we change this\, a small but not asymptotically zero amount of code would change behaviour in hard-to-pinpoint circumstances. That makes it important\, to my mind\, to give users the help and time they’ll need to spot it.

It can also construed a documented and advertised feature if you read the documentation hard enough. Therefore “fixing” it could be construed as breaking a documented interface.

But I suppose running a CPAN smoke using the transparent optimisation could tell us how much immediately obvious breakage results\, before we think further on this. My worry may be unwarranted?

PS.​: it is irritating how much verbiage it takes to talk clearly about   these in-and-of-themselves fairly small issues. I would have   written a shorter bikeshed\, but I did not have the knowledge how.

Regards\, -- Aristotle Pagaltzis // \<http​://plasmasturm.org/>

p5pRT commented 11 years ago

From agalama@mailfish.de

On Mon Jul 15 17​:58​:37 2013\, aristotle wrote​:

Does this do it for you?

diff --git i/pod/perlfunc.pod w/pod/perlfunc.pod index 18ecd40..c108733 100644 --- i/pod/perlfunc.pod +++ w/pod/perlfunc.pod @​@​ -6959\,8 +6959\,9 @​@​ produces the output 'a​:b​:c'\, but the following​: produces the output 'a​:b​:c​::​:'.

In time-critical applications\, it is worthwhile to avoid splitting -into more fields than necessary. Thus\, when assigning to a list\, -if LIMIT is omitted (or zero)\, then LIMIT is treated as though it +into more fields than necessary. Thus\, when the right-hand side of +a list assignment consists only of a C\\, and LIMIT is omitted +(or zero)\, then LIMIT is treated as though it were one larger than the number of variables in the list; for the following\, LIMIT is implicitly 3​:

Hmmm​:

($a\, $b) = split /\,/\, 'a\,'; # ('a'\, '') as expected ($a\, $b) = @​a = split /\,/\, 'a\,'; # also ('a'\, '') (!) although​: @​a = split /\,/\, 'a\,'; # just ('a')\, as expected

The second one would have to be changed if you change the docs your way.

Interestingly​:

($a\, $b) = @​a[0\,1] = split /\,/\, 'a\,'; # ('a'\, undef)

I am sure Schrödinger would have called his cat "Split". :-)

p5pRT commented 11 years ago

From @cpansprout

On Mon Jul 15 19​:59​:07 2013\, aristotle wrote​:

The user can omit the third argument (and trip up the optimisation) to make `split` *not* stop splitting *and* *not* return any trailing empty string field.

With the current interface there is no way for the user to ask for one of these (stop splitting at a certain field) *but not* the other (return a trailing empty string field).

I see what you are saying. That would require yet some other syntax. I don’t see much value in it.

What I understood you to be asking is whether

perl \-E '\($n\, $a\, $b\) = \(1\, split /\,/\, "a\,"\); say 1 if defined $b'
perl \-E '    \($a\, $b\) =    \(split /\,/\, "a\,"\); say 1 if defined $b'

should both be silent (where currently the latter isn’t) while at the same time the `split` on the second line should know to stop splitting at the third field\, unlike the one on the first line (as is currently true).

In other words\, should `split` be able to do something (stop splitting *but not* return a trailing empty field) that the user can’t currently explicitly ask it for?

If it is just an optimisation\, then the calling code can’t know whether perl finished splitting to the end of the string. (‘Write your code like this\, and you will make it easier for the compiler to optimise things and make them faster.’) So does it matter?

Now when you deparse `($a\,$b) = split//`\, it actually says this​:

\($a\, $b\) = split\(//\, $\_\, 3\);

If the optimisation were made transparent\, then deparsing the `split` like this would be a lie\, because if the user wrote that\, she would get a subtly different behaviour than if the optimiser makes it like that.

If we change the behaviour\, then we will obviously have to change the deparsed output. It is as simple as setting a flag on the split op and have B​::Deparse check that.

--

Father Chrysostomos