Open p5pRT opened 8 years ago
When possible regular expressions are compiled and checked at program compile time. This program gives an error even though the sub f() is never called:
sub f { /(/ }
But when the regexp includes variables\, this early checking cannot be done. The following code will only give an error if and when f() is called:
$x = 'a'; sub f { /($x/ }
In general\, this has to be so. Perl can't know what the possible values of $x will be at run time. If $x contains ')' then the regexp is well-formed.
However\, when building a regexp you may choose to use the \Q...\E mechanism as a safer alternative to raw string interpolation. Whatever appears inside \Q...\E is effectively matched as a literal string\, with regexp metacharacters like ) not having their usual effect. (As an implementation detail\, this might work by \Q...\E carefully escaping all such characters with backslashes before parsing the regexp\, but from the user's point of view you can consider it a way to match literal text contained in a scalar.)
If the variable appears in the regexp protected by \Q...\E then the syntax checks can still happen as normal.
sub f { /(\Q$x\E/ }
No matter the value of $x at run time\, this will never be a valid regexp; it will always have an unbalanced ( at the start. Perl could warn for this regexp at compile time just as it warns for /(/.
One possible way to do this would be to try making a munged version of the regexp\, replacing all \Q...\E fragments of the regexp with the dummy construct (?:). If the resulting munged regexp does not contain any variables\, then it can be used as a compile-time check; the munged regexp will be syntactically valid if and only if the original regexp is syntactically valid for all possible uses.
There might be a better way to do this checking depending on implementation details of the regexp engine.
* Further discussion
Not strictly part of the bug report\, but I would like to mention a possible avenue to improve compile-time checking of composed regexps when \Q...\E is not used.
The other major way to build up a regexp from variables is to include a variable which itself holds a compiled regular expression (qr/.../). Here\, too\, you can detect some regexp syntax errors for certain\, no matter what other regexp the variable may hold.
my $re = qr/whatever/; /($re/;
This is always going to be an invalid regexp as long as $re holds a compiled regular expression object. The compiled regexp included can never manage to close the open ( since it must itself have balanced parentheses. But Perl doesn't have a way to know that $re will always hold a compiled regexp at run time.
Suppose there were a way for the programmer to specify his or her intention that the variable will always hold a compiled regexp. Let's say that \I...\J is used for this (this syntax is arbitrary and only for the sake of discussion). So you would write
my $re_1 = qr/hel+o/; my $re_2 = qr/there/; my $re = qr/ \I$re_1\J \s+ \I$re_2\J /x;
At the point when $re is compiled\, and the values of $re_1 and $re_2 are included in the larger regexp string\, Perl would check at run time that they really were compiled regexp objects and not arbitrary scalars.
In itself\, this might catch a few bugs but is perhaps not worth the extra clunkiness. However\, it would also enable compile-time checking of regexp syntax in the same way as \Q...\J would.
my $re = qr/whatever/; /(\I$re\J/;
Here Perl can know that since $re is always going to be a compiled regexp\, the resulting regexp of /($re/ will always have unbalanced parens. The checking may have more subtleties than for \Q...\E since an included compiled regexp can change the number of capturing groups\, for example\, affecting the validity of later backreferences. Perhaps only the most basic syntax checks (like making sure parens are balanced) could be done at program compile time\, with more 'semantic' ones (like a backreference to a nonexistent group) done when the string interpolation has been done and the final regexp is compiled.
Finally\, only for those who enjoy the bondage and discipline\, an optional 'strict mode' would forbid arbitrary string interpolation in regexps\, requiring all variable uses to be explicitly labelled as either \Q...\E (for matching a literal string) or \I...\J (for including a precompiled regexp fragment). Of course\, there are cases where you want to build up a regexp from fragments which are not themselves valid regexps\, so strict mode would not be appropriate all the time and certainly not on by default. But it would be a useful way to eliminate regexp-injection bugs. (Taint mode helps too\, of course\, but is a run time check only.)
On Tue\, Oct 04\, 2016 at 07:29:05AM -0700\, Ed Avis wrote:
# New Ticket Created by "Ed Avis" # Please include the string: [perl #129803] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=129803 >
This is a bug report for perl from eda@waniasset.com\, generated with the help of perlbug 1.40 running under perl 5.22.2.
----------------------------------------------------------------- [Please describe your issue here]
When possible regular expressions are compiled and checked at program compile time. This program gives an error even though the sub f() is never called:
sub f \{ /\(/ \}
But when the regexp includes variables\, this early checking cannot be done. The following code will only give an error if and when f() is called:
$x = 'a'; sub f \{ /\($x/ \}
In general\, this has to be so. Perl can't know what the possible values of $x will be at run time. If $x contains ')' then the regexp is well-formed.
However\, when building a regexp you may choose to use the \Q...\E mechanism as a safer alternative to raw string interpolation. Whatever appears inside \Q...\E is effectively matched as a literal string\, with regexp metacharacters like ) not having their usual effect. (As an implementation detail\, this might work by \Q...\E carefully escaping all such characters with backslashes before parsing the regexp\, but from the user's point of view you can consider it a way to match literal text contained in a scalar.)
If the variable appears in the regexp protected by \Q...\E then the syntax checks can still happen as normal.
sub f \{ /\(\\Q$x\\E/ \}
No matter the value of $x at run time\, this will never be a valid regexp; it will always have an unbalanced ( at the start. Perl could warn for this regexp at compile time just as it warns for /(/.
One possible way to do this would be to try making a munged version of the regexp\, replacing all \Q...\E fragments of the regexp with the dummy construct (?:). If the resulting munged regexp does not contain any variables\, then it can be used as a compile-time check; the munged regexp will be syntactically valid if and only if the original regexp is syntactically valid for all possible uses.
There might be a better way to do this checking depending on implementation details of the regexp engine.
I don't see much of a benefit of this. You'd be adding an additional compilation of a regexp at compile time\, but then you have to throw away the compiled result\, as it's a different pattern than what would really be there. The only win is that if you write an regexp with a syntax error\, you get an error earlier. But once you have fixed the the error\, you keep paying the speed penalty each time you run the program. And if you don't make a mistake in the first place\, you pay the speed penalty.
Abigail
The RT System itself - Status changed from 'new' to 'open'
Yes\, the advantage is that of compile-time checking of regular expressions. Speaking personally\, I find this one of the advantages of Perl over other scripting languages\, where a regexp syntax error is caught only at run time. There is a small overhead associated with compiling the regular expression even though it may not in the end be used.
Migrated from rt.perl.org#129803 (status was 'open')
Searchable as RT129803$