Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
467 stars 181 forks source link

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

Open tripleee opened 5 years ago

tripleee commented 5 years ago

There are multiple overlapping and sometimes conflicting attempts to enumerate all the links in a post in findspam.py. See below for a sampling.

We should unify these, and ideally reduce the number of times we iterate over the message text looking for more or less the same information.

An object-oriented approach to the entire problem would seem like a natural but somewhat involved solution. Instead of scanning the raw text of the post over and over, use the _Post object we already have to store the links once, and then just use the object's methods to retrieve the links you want.

This should also make it easier to keep information about a post's features between different methods in findspam.py which look for distinct but related features in a post.

Just to give you an idea of the scope of the problem, here are a few of the methods which attempt to analyze links.

stale[bot] commented 4 years ago

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.