9fans / plan9port

Plan 9 from User Space
https://9fans.github.io/plan9port/
Other
1.61k stars 319 forks source link

plumber: quantitative regexp constraints of the 'matches' verb are too limiting #565

Closed igorburago closed 1 year ago

igorburago commented 2 years ago

Recently, as I’ve been extending my plumbing ruleset, I started bumping into two constants bounding plumber’s regexp matching capacity via the matches verb:

  1. Only the first 10 parenthesized subexpressions have their matches stored for later substitution within the rule; the set of corresponding built-in variables is limited to $1, …, $9.

  2. Only 16 character classes are allowed in a single regexp. This cap is imposed by libregexp which is used for regexp matching.

Out of the two limits, the second one is more pressing, as it outright rejects any regexp with more character classes, while the first one still matches the whole pattern, limiting only the number of match references.

The severeness of the second limit is additionally exacerbated in practice by the lack of counted repetition operators in Plan 9’s regexp syntax. If one is to match, say, an exactly 12-digit id followed or preceded by something else, it leaves one with only 4 character classes left for the remaining syntax around the id, which might not be enough. In some cases it might not be enough even for the id alone if matching by length is necessary. Full Git commit hash is 40 digits long, for example.


To provide a practical motivation for increasing both of the caps in question, let me show one real-world example extracted from my plumbing file, that exceeds these limits:

# File paths starting with shell parameter expansion:
# • ('$' env_var '/'),
# • ('"$' env_var '"/'),
# • ('${' env_var [(':-' | ':=' | ':?' | ':+') ...] '}/'), or
# • ('"${' env_var [(':-' | ':=' | ':?' | ':+') ...] '}"/').
type is text
data matches '('$sh_par_exp_bare_re'|"'$sh_par_exp_bare_re'"|'$sh_par_exp_braced_re'|"'$sh_par_exp_braced_re'")/('$file_path_re'('$addr_suffix_re')?)?'
plumb start /usr/bin/env bash -c 'x="${!2:?}"; [[ $x == /* ]] || x="$1/$x"; x="$x/$3"; if [[ -d $x ]]; then plumb -d edit "$x"; elif [[ -f $x ]]; then plumb -a "addr=$4" "$x"; fi' -- $wdir $2$3$4$6 $9 $11

This rule allows to plumb directory and file paths (optionally, with a trailing address) that start with an environment variable expansion (POSIX shell-style). For completeness, here are the variables used:

sh_par_exp_bare_re = '\$([A-Za-z_][A-Za-z_0-9]*)'
sh_par_exp_braced_re = '\${([A-Za-z_][A-Za-z_0-9]*)(:?[+\-=?][^}]*)?}'

ext_alpha_cc = 'A-Za-z¡-�'
ext_alnum_cc = $ext_alpha_cc'0-9'
ext_word_cc = $ext_alnum_cc'_\-'

file_name_last_char_no_space_cc = $ext_word_cc'+'
file_name_no_space_cc = $file_name_last_char_no_space_cc'@.,'
file_path_re = '(['$file_name_no_space_cc' /]*['$file_name_last_char_no_space_cc'/])'

addr_elem_regexp_word_re = '\^?[A-Za-z0-9_\-]+'
addr_elem_re = '(\.|\$|#?[0-9]+|/'$addr_elem_regexp_word_re'/)'
addr_elem_last_re = '(\.|\$|#?[0-9]+|/'$addr_elem_regexp_word_re'/?)'
addr_suffix_re = ':(('$addr_elem_re'[,;+\-])*'$addr_elem_last_re')'

In total, this particular rule requires 19 character classes and 14 parenthesized subexpressions, using match references up to $11.

For another example, suppose we would like to extend the subset of sam address prefixes to be used with file paths. For that, let us consider using the above addr_suffix_re definition, but with addr_elem_regexp_word_re exchanged for addr_elem_regexp_re matching the longest string that does not contain any backslash-escaped slashes:

addr_elem_regexp_re = '[^\\/]*(\\[^/][^\\/]*)*(\\/[^\\/]*(\\[^/][^\\/]*)*)*'
addr_elem_re = '(\.|\$|#?[0-9]+|/'$addr_elem_regexp_re'/)'
addr_elem_last_re = '(\.|\$|#?[0-9]+|/'$addr_elem_regexp_re'/?)'

This alone would set us back 15 character classes and 10 captured subexpressions out of our total budget of 16 and 10, respectively.

Currently, rules like these are either impossible at all or require breaking them into multiple almost identical rules, creating quite a bit of inconvenient repetition.


This issue is to discuss the possibility of increasing the said limits and the options for doing so.

  1. As to limit 1, with very little code added, plumber can be extended to support multi-digit subexpression match variables—see #566. This change alone would allow plumber to capture up to 31 subexpressions (as libregexp’s NSUBEXP is 32).

  2. The limit 1, then, can either be lifted completely by way of switching to dynamically reallocated Resublist in libregexp, or simply increased. The latter option seems to be more pragmatic to me, with 100 being a good candidate for the new limit (so that up to two-digit subexpression references are supported).

  3. Similarly, the limit 2, can also be either lifted via dynamic allocation (the way it is done in sam), or increased to, say, 128 (let me know if there is a better candidate for this limit).

Personally, I think a simple increase is a fine approach for points 2 and 3. I can submit pull requests for either approach, though—unless this whole proposal for increasing the limits will be considered unworthy—please let me know.

igorburago commented 2 years ago

As a side note: Neither of the two limitations are currently mentioned in the manual, as far as I can tell.

The plumb(7) page is vague about the first limit, sweeping it under the rug of an “etc.”:

$0 The text that matched the entire regular expression in a previous data matches rule. $1, $2, etc. refer to text matching the first, second, etc. parenthesized subexpression.

The second limit is more of an implementation detail, so I did not expect it to be covered by plumb(7), but I found it a bit surprising, however, that regexp(3), which plumb(7) refers the reader to via regexp(7), does not point out that both the number of captured subexpressions and the total number of character classes allowed in a regexp are capped.

igorburago commented 1 year ago

@dancrossnyc, since you have merged #566, thus accepting the first of the three points concluding my proposal above, I would be interested to know your opinion on the remaining two.

dancrossnyc commented 1 year ago

Huh; I don't recall closing it. Maybe I fat-fingered that. Anyway, #611 is merged.