Peculiar access to lexical vars from code in regexes

p5pRT commented 20 years ago

Migrated from rt.perl.org#26909 (status was 'resolved')

Searchable as RT26909$

p5pRT commented 20 years ago

From @jlokier

Created by @jlokier

Perl allows regular expressions to contain code which is executed when that position in the pattern is reached during a match. See `man perlre' for details. For example:

/(?:\n(?{$line++})|[^\n])*/;

If the code refers to lexical variables in the surrounding subroutine scope\, the code accesses the first _instance_ of those lexicals that exists when the regex is first compiled. Subsequent calls reference that instance\, even if the lexical has been destroyed and recreated by virtue of the scope having been exited and re-entered.

This is very peculiar behaviour.

For example\, the following subroutine returns (1\, 0\, 0)\, not the more logical (1\, 1\, 1):

sub test1() { map { my $x = 0; /(?{$x++})/; $x; } (1..3) }

This subroutine returns (1\, 2\, 3) as expected:

sub test2() { my $x = 0; map { /(?{$x++})/; $x; } (1..3) }

However\, when it is called a second time\, it returns (0\, 0\, 0).

These problems occur with string-interpolated regexes too\, independent of whether or not the regex "o" flag is used (because Perl still tries to avoid recompiling a regex if it hasn't changed between calls).

To ensure the expected variables are referenced inside the regex code\, the variables need to be at a large enough scope that they remain live between calls to the regex. For example\, this function returns (1\, 2\, 3) every time it is called:

my $test3_x; sub test3() { $test3_x = 0; map { /(?{$test3_x++})/; $test3_x; } (1..3) }

In general\, when code inside a regex references variables\, you have to make sure those variables are globals (declared with "our") or local to the package (declared with "my" _outside_ any "sub" definitions). If you forget to do this\, your program is likely harbouring an obscure bug which will be difficult to track down.

You can still use lexical scopes to confine the variable names to a small region of code. It simply has to be outside a scope which is exited and re-entered\, which usually means a subroutine scope. For example\, this subroutine also returns (1\, 2\, 3) every time it is called\, and doesn't define any variables which are visible to any other code:
{ my $x; sub test4() { $x = 0; map { /(?{$x++})/; $x; } (1..3) } }

When a regex object is defined using the "qr//" operator\, and it is called unchanged from a match ("m//") or substitution ("s///")\, code in the regex will access the first instances of lexical variables at the scope where the "qr//" appears\, not where it is called.

However if the regex object is interpolated into another pattern\, code will access the first instances of lexical variables at the point of interpolation.

For example:

my $x; # Outer $x. my $re = qr/(?{$x++})/; { # Inner $x. my $x; /$re/; # Code in the regex increments Outer $x. /$re()/; # Code in the regex increments Inner $x. }

A consequence of this behaviour is that you can't write self-contained parsing subroutines that look like this\, because they don't work:

sub count_cats_and_dogs($) { my ($cats\, $dogs) = (0\, 0); $_[0] =~ /(?:.*?\b(?:cat\b(?{$cats++})|dog\b(?{$dogs++})))*/g; return ($cats\, $dogs); }

Instead\, you have to write in this awkward style:

{ my ($cats\, $dogs); sub count_cats_and_dogs($) { ($cats\, $dogs) = (0\, 0); $_[0] =~ /(?:.*?\b(?:cat\b(?{$cats++})|dog\b(?{$dogs++})))*/g; return ($cats\, $dogs); } }

The worst thing is that it's _very_ easy to miss bugs like that. The code compiles fine\, and appears to work just fine until some corner case is matched\, and then it starts giving odd results that don't make sense until you notice this odd semantic.

Clearly\, the more intuitive behaviour is for code within a regex\, which is referencing lexical variables defined within a sub but outside the regex\, to access the instances of those lexicals which exist when the regex is called each time it is called.

Perl must already be passing some kind of static-chain for access to lexicals from evals in regexes\, because they appear to be properly thread-specific in interpreter threads - each thread does access its own instance of the variable. So I'd guess the static-chain is simply not correctly prepared.

If the intuitive behaviour is too difficult to implement\, or if it doesn't make sense after all (for example\, I'm not sure what behaviour makes sense in conjunction with qr// and lexicals defined in different sub() scopes to the one where the regex is called)\, then a warning would be a very desirable addition:

A warning whenever code in a regex references a lexical that is named inside a sub() scope would be _exceedingly_ useful. It is almost always a program bug. Lexicals names outside all sub() scopes should not induce the warning.

Also\, I didn't see anything in the FAQ about this.

Thanks\, -- Jamie

Perl Info

``` Flags: category=core severity=medium Site configuration information for perl v5.8.0: Configured by bhcompile' cf_email='bhcompile at Wed Aug 13 11:45:59 EDT 2003. Summary of my rderl (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.21-1.1931.2.382.entsmp, archname=i386-linux-thread-multi uname='linux str' config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef' useithreads=define usemultiplicity= useperlio= d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=un uselongdouble= usemymalloc=, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)', gccosandvers='' gccversion='3.2.2 200302' intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long' k', ivsize=4' ivtype='l, nvtype='double' o_nonbl', nvsize=, Off_t='', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc' l', ldflags =' -L/u' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil perllibs= libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef, ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE' cccdlflags='-fPIC' ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s Unicode/Normalize XS/A' Locally applied patches: MAINT18379 @INC for perl v5.8.0: /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.0 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.0/i386-linux-thread-multi /usr/lib/perl5/5.8.0 . Environment for perl v5.8.0: HOME=/home/jamie LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/jamie/bin PERL_BADLANG (unset) SHELL=/bin/bash dlflags='-share (unset) ```

p5pRT commented 20 years ago

From @rgs

Not a bug. My reply is here (why didn't it get through RT ?) :

http://groups.google.com/groups?threadm=20040221225833.6d1dc31c.rgarciasuarez%40free.fr

p5pRT commented 20 years ago

@rgs - Status changed from 'new' to 'resolved'

p5pRT commented 20 years ago

From @jlokier

Rafael Garcia-Suarez via RT wrote:

According to our records\, your request regarding "Peculiar access to lexical vars from code in regexes" has been resolved.

Has it? I didn't see a response.

-- Jamie

Perl / perl5