DmitrySoshnikov / regexp-tree

Regular expressions processor in JavaScript
MIT License
401 stars 45 forks source link

Blacklist sub-rules of Transforms #211

Open ghnp5 opened 4 years ago

ghnp5 commented 4 years ago

Hello,

Is there a way to disable sub-rules of transforms, without disabling the whole transform?

For example, I don't want to disable all "charEscapeUnescape", but only the sub-rule for "]".

/\[quote\]/

For readability, I don't want to optimize to /\[quote]/. But fo all other unnecessary escapes, I want optimizations.

Also, similar for "charClassToMeta". Is there a way I could disable only the conversion from [0-9a-z] to [\da-z] ?

Thank you!

EDIT - this might be related to https://github.com/DmitrySoshnikov/regexp-tree/issues/208

DmitrySoshnikov commented 4 years ago

@ghnp5, indeed, it might be a part of #208.

I think the optimize method may accept perTransformOptions map, and each individual transform will be able to work with its specific options.

In particular, the caller may look like:

regexpTree.optimize(/\[quote\]/, {
  perTransformOptions: {

    charEscapeUnescape: {
      // Avoid rewriting /\[quote\]/ as /\[quote]/
      excludedChars: /[\[\]]/,
    },

    charClassToMeta: {
      // Avoid rewriting [0-9a-z] as [\da-z]
      excludedClasses: [/[0-9]/],
    }, 
  }
});

Then the corresponding transforms have to be updated to accept those options and handle. A caveat: such granular checks and exclusions may slow down transforms.

I may take a look into this, and will appreciate a PR on this too in case you'll reach it earlier than me.

In addition, to faster unblock yourself, you can just write your own extra transform which translates \d+ back to [0-9]

b-fett commented 1 year ago

also transformation from [0-9] to [\d] is not safe as they aren't equivalent

DmitrySoshnikov commented 1 year ago

@b-fett is it Perl-specific or universal? We might need to start introducing --unsafe or --safe parameter which will take care of specific regexp rules.

b-fett commented 1 year ago

there are many cases

gpt says the next:

In most programming languages, the regular expression \d is equivalent to [0-9] and matches any single digit from 0 to 9. However, the behavior can change when dealing with Unicode characters in some languages. Here's a brief overview:            

 1 JavaScript: As mentioned earlier, when the Unicode flag (u) is used, \d can match any character that's considered a digit in the Unicode standard, which includes digit characters from other languages. [0-9] will only match ASCII digits.        
 2 Python: Python's re module has a UNICODE flag. When this flag is set, \d will match any Unicode digit from any script. Without the flag, \d is equivalent to [0-9].                                                                                 
 3 Java: In Java, \d matches any digit from any script (not just ASCII), because Java regular expressions are Unicode-based by default. [0-9] will only match ASCII digits.                                                                            
 4 Ruby: Ruby's regular expressions are also Unicode-based by default, so \d will match any Unicode digit from any script. [0-9] will only match ASCII digits.                                                                                         
 5 Perl: Perl's behavior is similar to Python's. \d will match any Unicode digit from any script when the use utf8; directive is in effect. Without it, \d is equivalent to [0-9].                                                                     
 6 PHP: PHP's preg functions are Unicode-aware. \d will match any Unicode digit from any script when the u modifier is used. Without the modifier, \d is equivalent to [0-9].                                                                          

In general, if you're dealing with a programming language or regular expression engine that supports Unicode, and you want to match only ASCII digits, you should use [0-9]. If you want to match any digit character, including digit characters from 
other languages, you should use \d.
DmitrySoshnikov commented 1 year ago

@b-fett thanks, I think we can disable this specific transform if /u flag is set. Feel free to submit a PR.