Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
476 stars 182 forks source link

Implement PCRE style in-regex comments; e.g. (?#comment) #2404

Closed makyen closed 6 years ago

makyen commented 6 years ago

The Python regex implementation we used does not appear to implement any method of having in-regex-text comments which would work in the watchlist and blacklists.1 It would be beneficial for us to be able to include comments in at least our watchlist and blacklist entries, and potentially the other regexes that we use in findspam.py. PCRE implements in-regex comments using comments like (?#comment).

It would be relatively easy for us to implement support for PCRE style regex comments. These could be implemented by just removing from the strings we convert to regexes any content which matches the regex \(\?#(?<!(?:[^\\]|^)(?:\\\\)*\\\(\?#)[^)]*\).2

This substitution could be performed at one of the following points (listed in in order of increasing generality):

  1. For watchlist and blacklists only: when we read the watchlist and blacklist lines from the files
  2. All 'regex' detections: just prior to using regex.compile() on the text provided in all the 'regex' detections, or
  3. All regexes: as a wrapper to regex.compile().

  1. There is the possibility of "Verbose" regexes using the X flag, which I assume is also available in the regex module we're using. However, using these would not address having comments in the watchlist and blacklists.
  2. That regex is untested, as it relies on variable length look-behinds for which I don't have a simulator/tester. The regex \(\?#(?<!^\\\(\?#)(?<![^\\]\\\(\?#)(?<!\\\\\\\(\?#)[^)]*\) is tested and correctly matches, or not, for up to 3 \ escapes prior to the `(?#comment).
quartata commented 6 years ago

it would be a lot easier just to use python-pcre

iBug commented 6 years ago

You can just blacklist/watch something that starts with a hash.

iBug commented 6 years ago

oops, accidentally tapped the wrong button, please ignore

makyen commented 6 years ago

@quartata Assuming that python-pcre does implement just PCRE, it's unlikely that it would be easier to use it. We currently use capabilities (e.g. variable length look-behind) which are not part of PCRE. If it was me, I'd much rather implement a single regex-replace (which is all that implementing this requires) than take on the known and unknown issues of moving to a different regex implementation.

makyen commented 6 years ago

Prior to writing this RFE, I had found non-official documentation which said these are not implemented. In addition, the official documentation I looked in didn't mention them, while mentioning other types of comments ("verbose" regular expressions).

However, having looked in the source code for regex, it does appear that this style of comment is already implemented as a standard part of both the re and regex implementations. After finding it in the source, I also found it in the re documentation.

So, there's no need for this RFE, as it's already natively supported. So, sorry to waste everyone's time.