Regexp syntax check when variables contained with \Q...\E

p5pRT commented 8 years ago

Migrated from rt.perl.org#129803 (status was 'open')

Searchable as RT129803$

p5pRT commented 8 years ago

From @epa

Created by @epa

When possible regular expressions are compiled and checked at program compile time. This program gives an error even though the sub f() is never called:

sub f { /(/ }

But when the regexp includes variables\, this early checking cannot be done. The following code will only give an error if and when f() is called:

$x = 'a'; sub f { /($x/ }

In general\, this has to be so. Perl can't know what the possible values of $x will be at run time. If $x contains ')' then the regexp is well-formed.

However\, when building a regexp you may choose to use the \Q...\E mechanism as a safer alternative to raw string interpolation. Whatever appears inside \Q...\E is effectively matched as a literal string\, with regexp metacharacters like ) not having their usual effect. (As an implementation detail\, this might work by \Q...\E carefully escaping all such characters with backslashes before parsing the regexp\, but from the user's point of view you can consider it a way to match literal text contained in a scalar.)

If the variable appears in the regexp protected by \Q...\E then the syntax checks can still happen as normal.

sub f { /(\Q$x\E/ }

No matter the value of $x at run time\, this will never be a valid regexp; it will always have an unbalanced ( at the start. Perl could warn for this regexp at compile time just as it warns for /(/.

One possible way to do this would be to try making a munged version of the regexp\, replacing all \Q...\E fragments of the regexp with the dummy construct (?:). If the resulting munged regexp does not contain any variables\, then it can be used as a compile-time check; the munged regexp will be syntactically valid if and only if the original regexp is syntactically valid for all possible uses.

There might be a better way to do this checking depending on implementation details of the regexp engine.

* Further discussion

Not strictly part of the bug report\, but I would like to mention a possible avenue to improve compile-time checking of composed regexps when \Q...\E is not used.

The other major way to build up a regexp from variables is to include a variable which itself holds a compiled regular expression (qr/.../). Here\, too\, you can detect some regexp syntax errors for certain\, no matter what other regexp the variable may hold.

my $re = qr/whatever/; /($re/;

This is always going to be an invalid regexp as long as $re holds a compiled regular expression object. The compiled regexp included can never manage to close the open ( since it must itself have balanced parentheses. But Perl doesn't have a way to know that $re will always hold a compiled regexp at run time.

Suppose there were a way for the programmer to specify his or her intention that the variable will always hold a compiled regexp. Let's say that \I...\J is used for this (this syntax is arbitrary and only for the sake of discussion). So you would write

my $re_1 = qr/hel+o/; my $re_2 = qr/there/; my $re = qr/ \I$re_1\J \s+ \I$re_2\J /x;

At the point when $re is compiled\, and the values of $re_1 and $re_2 are included in the larger regexp string\, Perl would check at run time that they really were compiled regexp objects and not arbitrary scalars.

In itself\, this might catch a few bugs but is perhaps not worth the extra clunkiness. However\, it would also enable compile-time checking of regexp syntax in the same way as \Q...\J would.

my $re = qr/whatever/; /(\I$re\J/;

Here Perl can know that since $re is always going to be a compiled regexp\, the resulting regexp of /($re/ will always have unbalanced parens. The checking may have more subtleties than for \Q...\E since an included compiled regexp can change the number of capturing groups\, for example\, affecting the validity of later backreferences. Perhaps only the most basic syntax checks (like making sure parens are balanced) could be done at program compile time\, with more 'semantic' ones (like a backreference to a nonexistent group) done when the string interpolation has been done and the final regexp is compiled.

Finally\, only for those who enjoy the bondage and discipline\, an optional 'strict mode' would forbid arbitrary string interpolation in regexps\, requiring all variable uses to be explicitly labelled as either \Q...\E (for matching a literal string) or \I...\J (for including a precompiled regexp fragment). Of course\, there are cases where you want to build up a regexp from fragments which are not themselves valid regexps\, so strict mode would not be appropriate all the time and certainly not on by default. But it would be a useful way to eliminate regexp-injection bugs. (Taint mode helps too\, of course\, but is a run time check only.)

Perl Info

``` Flags: category=core severity=wishlist Site configuration information for perl 5.22.2: Configured by Red Hat, Inc. at Wed Aug 3 14:06:20 UTC 2016. Summary of my perl5 (revision 5 version 22 subversion 2) configuration: Platform: osname=linux, osvers=4.6.3-300.fc24.x86_64, archname=x86_64-linux-thread-multi uname='linux buildvm-08.phx2.fedoraproject.org 4.6.3-300.fc24.x86_64 #1 smp fri jun 24 20:52:41 utc 2016 x86_64 x86_64 x86_64 gnulinux ' config_args='-des -Doptimize=none -Dccflags=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dldflags=-Wl,-z,relro -Dccdlflags=-Wl,--enable-new-dtags -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dshrpdir=/usr/lib64 -DDEBUGGING=-g -Dversion=5.22.2 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize=' -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fwrapv -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='5.3.1 20160406 (Red Hat 5.3.1-6)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags ='-Wl,-z,relro -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib64 /lib64 /usr/lib64 /usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib libs=-lpthread -lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lresolv -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.22' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-z,relro ' cccdlflags='-fPIC', lddlflags='-shared -Wl,-z,relro -L/usr/local/lib -fstack-protector-strong' Locally applied patches: Fedora Patch1: Removes date check, Fedora/RHEL specific Fedora Patch3: support for libdir64 Fedora Patch4: use libresolv instead of libbind Fedora Patch5: USE_MM_LD_RUN_PATH Fedora Patch6: Skip hostname tests, due to builders not being network capable Fedora Patch7: Dont run one io test due to random builder failures Fedora Patch15: Define SONAME for libperl.so Fedora Patch16: Install libperl.so to -Dshrpdir value Fedora Patch22: Document Math::BigInt::CalcEmu requires Math::BigInt (CPAN RT#85015) Fedora Patch26: Make *DBM_File desctructors thread-safe (RT#61912) Fedora Patch27: Make PadlistNAMES() lvalue again (CPAN RT#101063) Fedora Patch28: Make magic vtable writable as a work-around for Coro (CPAN RT#101063) Fedora Patch29: Fix duplicating PerlIO::encoding when spawning threads (RT#31923) Fedora Patch30: Do not let XSLoader load relative paths (CVE-2016-6185) Fedora Patch31: Avoid loading optional modules from default . (CVE-2016-1238) Fedora Patch200: Link XS modules to libperl.so with EU::CBuilder on Linux Fedora Patch201: Link XS modules to libperl.so with EU::MM on Linux @INC for perl 5.22.2: /home/eda/lib64/perl5/ /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 Environment for perl 5.22.2: HOME=/home/eda LANG=en_GB.UTF-8 LANGUAGE (unset) LC_COLLATE=C LC_CTYPE=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_NUMERIC=en_GB.UTF-8 LC_TIME=en_GB.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/eda/bin:/home/eda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/home/eda/.local/bin:/home/eda/bin:/sbin:/usr/sbin PERL5LIB=/home/eda/lib64/perl5/ PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 8 years ago

From @abigail

On Tue\, Oct 04\, 2016 at 07:29:05AM -0700\, Ed Avis wrote:

# New Ticket Created by "Ed Avis" # Please include the string: [perl #129803] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=129803 >

This is a bug report for perl from eda@waniasset.com\, generated with the help of perlbug 1.40 running under perl 5.22.2.

----------------------------------------------------------------- [Please describe your issue here]

When possible regular expressions are compiled and checked at program compile time. This program gives an error even though the sub f() is never called:
sub f \{ /\(/ \}
But when the regexp includes variables\, this early checking cannot be done. The following code will only give an error if and when f() is called:
$x = 'a';
sub f \{ /\($x/ \}
In general\, this has to be so. Perl can't know what the possible values of $x will be at run time. If $x contains ')' then the regexp is well-formed.

However\, when building a regexp you may choose to use the \Q...\E mechanism as a safer alternative to raw string interpolation. Whatever appears inside \Q...\E is effectively matched as a literal string\, with regexp metacharacters like ) not having their usual effect. (As an implementation detail\, this might work by \Q...\E carefully escaping all such characters with backslashes before parsing the regexp\, but from the user's point of view you can consider it a way to match literal text contained in a scalar.)

If the variable appears in the regexp protected by \Q...\E then the syntax checks can still happen as normal.
sub f \{ /\(\\Q$x\\E/ \}
No matter the value of $x at run time\, this will never be a valid regexp; it will always have an unbalanced ( at the start. Perl could warn for this regexp at compile time just as it warns for /(/.

One possible way to do this would be to try making a munged version of the regexp\, replacing all \Q...\E fragments of the regexp with the dummy construct (?:). If the resulting munged regexp does not contain any variables\, then it can be used as a compile-time check; the munged regexp will be syntactically valid if and only if the original regexp is syntactically valid for all possible uses.

There might be a better way to do this checking depending on implementation details of the regexp engine.

I don't see much of a benefit of this. You'd be adding an additional compilation of a regexp at compile time\, but then you have to throw away the compiled result\, as it's a different pattern than what would really be there. The only win is that if you write an regexp with a syntax error\, you get an error earlier. But once you have fixed the the error\, you keep paying the speed penalty each time you run the program. And if you don't make a mistake in the first place\, you pay the speed penalty.

Abigail

p5pRT commented 8 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 8 years ago

From @epa

Yes\, the advantage is that of compile-time checking of regular expressions. Speaking personally\, I find this one of the advantages of Perl over other scripting languages\, where a regexp syntax error is caught only at run time. There is a small overhead associated with compiling the regular expression even though it may not in the end be used.

Perl / perl5