9fans / plan9port

Plan 9 from User Space
https://9fans.github.io/plan9port/
Other
1.61k stars 319 forks source link

plumber: match variables past the first non-matching subexpression are left unset #563

Closed igorburago closed 1 year ago

igorburago commented 2 years ago

When processing a matches directive of a plumbing rule, plumber incorrectly assumes that if, after matching the given pattern, a parenthesized subexpression had not matched, all the subsequent ones had not matched either, and leaves their corresponding variables unset.

This erroneous assumption does not hold for any patterns with captured alternations, such as ((A)|"(B)")(C), for instance. If the expression matches at all—regardless of whether the A or B branch of the leading alternation matched—C (and all subsequent parenthesized subexpressions, if present) will not be captured by plumber.

The same applies to expressions with the * and ? repetition operators. For example, matching the pattern (X)?(Y) against the string Y will not set any subexpression variables aside from $0, erroneously leaving both $1 and $2 empty.

To reproduce, add the following rule (as the very first one):

kind is text
data matches '<((A)|"(B)")(C)>'
plumb start rc -c 'printf ''«%s»'' $* >/tmp/x' $0 $1 $2 $3 $4

When plumb '<AC>' or plumb '<"B"C>' is run, the output in /tmp/x will be «<AC>»«A»«A»«»«» and «<"B"C>»«"B"»«»«»«», respectively, whereas the correct output would be «<AC>»«A»«A»«»«C» and «<"B"C>»«"B"»«»«B»«C».

(Other users of the libregexp library in the tree seem to handle subexpression match capturing in patterns like these correctly.)