Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
476 stars 182 forks source link

Should blacklisted_websites be anchored? #2796

Open tripleee opened 5 years ago

tripleee commented 5 years ago

I thought this was a regression but it looks (from going back a couple of years in Git history) that the regex for blacklisted websites was in fact never anchored.

We have a false positive where ello.co matches trello.com and of course, fixing that in isolation would be trivial; but do we really have any cases where the absence of anchoring is actually a feature, or is this a bug which should have been fixed long ago?

ArtOfCode- commented 5 years ago

Sounds buggish to me

makyen commented 5 years ago

I'd also looked into this, and found the same thing. My opinion is that these definitely should be anchored. Not anchoring them results in FP and is contrary to most people's expectations when moving something from the watchlist to the website blacklist. If a particular regex is intended to be matched unanchored, then it should be specifically constructed to do so, rather than all regexes in the list needing to be constructed with their own individual anchors.

My impression from previously looking at the regexes was that some few of the regexes rely on the fact that it's not anchored, but that the majority were written with the expectation that they are anchored and occasionally produce FP. Unfortunately, the only way I've seen to accurately determine if each individual regex is intended to be anchored or unanchored would be to run each regex through a MS search with and without the anchor which we choose and look at the number of TP and FP that are produced in each case.

Note: the anchor needs to be more complex than just \b, due to the validity of - in domain names and the assumption that a website regex should, by default, match up to the TLD, rather than only a subdomain (i.e. regexes wanting to match only a subdomain should be required to be specifically constructed to do so). The need for different anchors than are used in the watchlist/blacklist-keyword argues that we should have a watchlist that is specifically for websites. Doing so would make it easier to accurately determine the effectiveness of a watched domain.

makyen commented 5 years ago

Currently, for those website blacklist items which I see detect something they aren't intended to detect, I'm using bookends of:

\b(?<![^\W_]-)foobar\.com(?![.-][^\W_])\b
stale[bot] commented 5 years ago

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.