Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
474 stars 182 forks source link

Expand scope of whitelisting to all domain lookups #3006

Closed tripleee closed 4 years ago

tripleee commented 5 years ago

Is your feature request related to a problem? Please describe.

There is a number of domains which routinely triggers FPs because some of the watches are very broad. We want to be able to exclude well-known good sites from these broad watches in order to improve precision and reduce noise.

Describe the solution you'd like

bertieb implemented whitelisting for ASN checks in https://github.com/Charcoal-SE/SmokeDetector/pull/2664 and I was thinking already at the time that this should be refactored to govern all domain name checks.

Describe alternatives you've considered

Perhaps this should be coupled with a broader review of FPs so we can disable entire reasons (e.g. individual ASNs which produce too many FPs?) but let's keep this focused on the technical implementation.

Additional context

This has been raised in chat repeatedly over the last couple of weeks. I don't think it should be hard to do.

tripleee commented 5 years ago

Can't assign to @bertiebaggio explicitly it seems, but he was volunteering to look into this. https://chat.stackexchange.com/transcript/message/50359451#50359451

tripleee commented 5 years ago

Tangentially related perhaps: https://github.com/Charcoal-SE/SmokeDetector/pull/1630

bertiebaggio commented 5 years ago

Thanks for the ping :smile:

Discussion of a more general whitelist came up when considering the ASN whitelist, @makyen's thoughts seem relevant here:

I agree that a full implementation of whitelisting would be beneficial, but then we're talking about affecting lots of different detection reasons. There are also times when we want different whitelists for different detections, and to not share the list, or at least not share some entries between detections. A full implementation gets complex.

Do we have a few representative examples of things we'd like to exclude? I've been away from the Smokey coalface due to a job application recently so have missed some of the chat around this.

tripleee commented 5 years ago

Mithrandir pointed out a few in chat last week, I think search for when I mentioned "bertieb" as a quick shortcut, or I can try to provide links tomorrow. Glorfindel mentioned one today, I think xda-develop.com or similar. A search in the FPs woud probably be more methodologically sound, similar to what I did for reviewing ASN:s today (I think #3007)

ArtOfCode- commented 5 years ago

@bertiebaggio check your inbox for an org invite - that should make it possible to actually assign you here.

angussidney commented 5 years ago

Related (possibly duplicate?): https://github.com/Charcoal-SE/SmokeDetector/issues/490

bertiebaggio commented 5 years ago

tripleee: Thanks, I'll have a look through chat history

Art: done, thanks!

tripleee commented 5 years ago

Pling, any progress?

tripleee commented 5 years ago

@machavity mentions pub.dev: https://chat.stackexchange.com/transcript/message/51136860#51136860

stale[bot] commented 4 years ago

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

ArtOfCode- commented 4 years ago

As of ce83f319abed51e6d93ce4405ce0be25164603ef, I've added an is_website_whitelisted helper method, and used it in a few checks in findspam.py (often through the is_whitelisted_website method that was already in there - though that only checked a small number of regexes).

The new helper method feeds from the metasmoke API: any domain that's tagged with whitelisted will be excluded from Smokey's domain checks.

We can also add the helper to more findspam checks if we think it's necessary.