Closed makyen closed 6 years ago
it would be a lot easier just to use python-pcre
You can just blacklist/watch something that starts with a hash.
oops, accidentally tapped the wrong button, please ignore
@quartata Assuming that python-pcre does implement just PCRE, it's unlikely that it would be easier to use it. We currently use capabilities (e.g. variable length look-behind) which are not part of PCRE. If it was me, I'd much rather implement a single regex-replace (which is all that implementing this requires) than take on the known and unknown issues of moving to a different regex implementation.
Prior to writing this RFE, I had found non-official documentation which said these are not implemented. In addition, the official documentation I looked in didn't mention them, while mentioning other types of comments ("verbose" regular expressions).
However, having looked in the source code for regex
, it does appear that this style of comment is already implemented as a standard part of both the re
and regex
implementations. After finding it in the source, I also found it in the re
documentation.
So, there's no need for this RFE, as it's already natively supported. So, sorry to waste everyone's time.
The Python regex implementation we used does not appear to implement any method of having in-regex-text comments which would work in the watchlist and blacklists.1 It would be beneficial for us to be able to include comments in at least our watchlist and blacklist entries, and potentially the other regexes that we use in findspam.py. PCRE implements in-regex comments using comments like
(?#comment)
.It would be relatively easy for us to implement support for PCRE style regex comments. These could be implemented by just removing from the strings we convert to regexes any content which matches the regex
\(\?#(?<!(?:[^\\]|^)(?:\\\\)*\\\(\?#)[^)]*\)
.2This substitution could be performed at one of the following points (listed in in order of increasing generality):
'regex'
detections: just prior to usingregex.compile()
on the text provided in all the'regex'
detections, orregex.compile()
.X
flag, which I assume is also available in theregex
module we're using. However, using these would not address having comments in the watchlist and blacklists.\(\?#(?<!^\\\(\?#)(?<![^\\]\\\(\?#)(?<!\\\\\\\(\?#)[^)]*\)
is tested and correctly matches, or not, for up to 3\
escapes prior to the `(?#comment).