Open ghnp5 opened 4 years ago
@ghnp5, indeed, it might be a part of #208.
I think the optimize method may accept perTransformOptions
map, and each individual transform will be able to work with its specific options.
In particular, the caller may look like:
regexpTree.optimize(/\[quote\]/, {
perTransformOptions: {
charEscapeUnescape: {
// Avoid rewriting /\[quote\]/ as /\[quote]/
excludedChars: /[\[\]]/,
},
charClassToMeta: {
// Avoid rewriting [0-9a-z] as [\da-z]
excludedClasses: [/[0-9]/],
},
}
});
Then the corresponding transforms have to be updated to accept those options and handle. A caveat: such granular checks and exclusions may slow down transforms.
I may take a look into this, and will appreciate a PR on this too in case you'll reach it earlier than me.
In addition, to faster unblock yourself, you can just write your own extra transform which translates \d+
back to [0-9]
also transformation from [0-9] to [\d] is not safe as they aren't equivalent
@b-fett is it Perl-specific or universal? We might need to start introducing --unsafe
or --safe
parameter which will take care of specific regexp rules.
there are many cases
gpt says the next:
In most programming languages, the regular expression \d is equivalent to [0-9] and matches any single digit from 0 to 9. However, the behavior can change when dealing with Unicode characters in some languages. Here's a brief overview:
1 JavaScript: As mentioned earlier, when the Unicode flag (u) is used, \d can match any character that's considered a digit in the Unicode standard, which includes digit characters from other languages. [0-9] will only match ASCII digits.
2 Python: Python's re module has a UNICODE flag. When this flag is set, \d will match any Unicode digit from any script. Without the flag, \d is equivalent to [0-9].
3 Java: In Java, \d matches any digit from any script (not just ASCII), because Java regular expressions are Unicode-based by default. [0-9] will only match ASCII digits.
4 Ruby: Ruby's regular expressions are also Unicode-based by default, so \d will match any Unicode digit from any script. [0-9] will only match ASCII digits.
5 Perl: Perl's behavior is similar to Python's. \d will match any Unicode digit from any script when the use utf8; directive is in effect. Without it, \d is equivalent to [0-9].
6 PHP: PHP's preg functions are Unicode-aware. \d will match any Unicode digit from any script when the u modifier is used. Without the modifier, \d is equivalent to [0-9].
In general, if you're dealing with a programming language or regular expression engine that supports Unicode, and you want to match only ASCII digits, you should use [0-9]. If you want to match any digit character, including digit characters from
other languages, you should use \d.
@b-fett thanks, I think we can disable this specific transform if /u
flag is set. Feel free to submit a PR.
Hello,
Is there a way to disable sub-rules of transforms, without disabling the whole transform?
For example, I don't want to disable all "charEscapeUnescape", but only the sub-rule for "]".
For readability, I don't want to optimize to
/\[quote]/
. But fo all other unnecessary escapes, I want optimizations.Also, similar for "charClassToMeta". Is there a way I could disable only the conversion from
[0-9a-z]
to[\da-z]
?Thank you!
EDIT - this might be related to https://github.com/DmitrySoshnikov/regexp-tree/issues/208