Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
474 stars 182 forks source link

New filter: website resembles username #450

Closed Glorfindel83 closed 7 years ago

Glorfindel83 commented 7 years ago

E.g. for these kind of spam posts, which go undetected quite often or https://metasmoke.erwaysoftware.com/post/52946 https://metasmoke.erwaysoftware.com/post/52841 https://metasmoke.erwaysoftware.com/post/51936

Procedure: replace spaces in username by \W? and check if there's a link in the post which contains that string. There are some users with 3 character usernames which have a chance of accidentally triggering the filter. Maybe this should only work for usernames above a certain length.

magisch commented 7 years ago

Sounds like a good idea

ArtOfCode- commented 7 years ago

Actually, having assigned this to myself, I've just realised this isn't currently possible. We only check one of username/title/body/summary at a time, so there's no point when check code has access to both.

magisch commented 7 years ago

@ArtOfCode- Wouldn't it be possible to schedule the Username check before the body check and save the username temporarily so you can access it in the body check?

ArtOfCode- commented 7 years ago

@magisch Possibly. Would have to look at that.

Undo1 commented 7 years ago

Sounds messy, I'd probably be against that. It'd be better to just make a new reason-method type that takes all parts of the post at once.

ghost commented 7 years ago

@undo1 agree

AWegnerGitHub commented 7 years ago

Are there other test cases to run against? Right now I am using the following tests:

checks = [
    ("http://www.price-buy.com/", "Price Buy"),
    ("https://thebestparkourgear.com/backpack-for-parkour/", "TheBestParkourGear"),
    ("httl://bestonwardticket.com", "Best onward Ticket"),
    ("https://i.stack.imgur.com/eS6WQ.jpg", "Best onward Ticket"),
    ("www.stackoverflow.com", "Andy"),
    ("www.stackoverflow.notarealtld", "Andy"),
    ("stackoverflow.notarealtld", "Andy"),
    ("http://stackoverflow.notarealtld", "Andy"),
    ("httl://stackoverflow.notarealtld", "Andy"),
]

I get the following results:

SIMILAR: (1.0) => Name: Price Buy, Domain: http://www.price-buy.com/
SIMILAR: (1.0) => Name: TheBestParkourGear, Domain: https://thebestparkourgear.com/backpack-for-parkour/
SIMILAR: (1.0) => Name: Best onward Ticket, Domain: httl://bestonwardticket.com
NOT SIMILAR: (0.0952380952381) => Name: Best onward Ticket, Domain: https://i.stack.imgur.com/eS6WQ.jpg
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.com
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.notarealtld
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: http://stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: httl://stackoverflow.notarealtld

It's a little messier than I thought it'd be, and does require a library be added to Smokey, but it works. My tests have been pretty simple so far. I've only passed the domain, not the entire body of the text. Doing that will require an HTML parser (likely BeautifulSoup), so that'd need to be included too.

What I need:

Undo1 commented 7 years ago

That looks awesome. We already have beautifulsoup (4, I think?), and that TLD library is tiny. Have the code for this in a branch somewhere?

On Tue, Feb 21, 2017, 7:34 AM A Wegner notifications@github.com wrote:

Are there other test cases to run against? Right now I am using the following tests:

checks = [ ("http://www.price-buy.com/", "Price Buy"), ("https://thebestparkourgear.com/backpack-for-parkour/", "TheBestParkourGear"), ("httl://bestonwardticket.com", "Best onward Ticket"), ("https://i.stack.imgur.com/eS6WQ.jpg", "Best onward Ticket"), ("www.stackoverflow.com", "Andy"), ("www.stackoverflow.notarealtld", "Andy"), ("stackoverflow.notarealtld", "Andy"), ("http://stackoverflow.notarealtld", "Andy"), ("httl://stackoverflow.notarealtld", "Andy"), ]

I get the following results:

SIMILAR: (1.0) => Name: Price Buy, Domain: http://www.price-buy.com/ SIMILAR: (1.0) => Name: TheBestParkourGear, Domain: https://thebestparkourgear.com/backpack-for-parkour/ SIMILAR: (1.0) => Name: Best onward Ticket, Domain: httl://bestonwardticket.com NOT SIMILAR: (0.0952380952381) => Name: Best onward Ticket, Domain: https://i.stack.imgur.com/eS6WQ.jpg NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.com NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.notarealtld NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: stackoverflow.notarealtld NOT SIMILAR: (0.0) => Name: Andy, Domain: http://stackoverflow.notarealtld NOT SIMILAR: (0.0) => Name: Andy, Domain: httl://stackoverflow.notarealtld


It's a little messier than I thought it'd be, and does require a library be added to Smokey https://pypi.python.org/pypi/tld, but it works. My tests have been pretty simple so far. I've only passed the domain, not the entire body of the text. Doing that will require an HTML parser (likely BeautifulSoup), so that'd need to be included too.

What I need:

  • The OK to include at least 1 new library: tld https://pypi.python.org/pypi/tld. If we don't already include BeautifulSoup, we also need to include that for parsing the links out of the body.
  • More test cases so I can throw those into here and make sure I'm not missing any other cases.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Charcoal-SE/SmokeDetector/issues/450#issuecomment-281360868, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7FZuJHsM3EApNMozuFGhcHF6gYUK0mks5revXkgaJpZM4LfV4q .

Glorfindel83 commented 7 years ago

Good job! Here's another TP from today: https://metasmoke.erwaysoftware.com/post/58200 Also, one of your testcases has a httl://. I don't know that scheme.

ArtOfCode- commented 7 years ago

@Glorfindel83 HyperText Testing Language

AWegnerGitHub commented 7 years ago

No, no branch yet. I've been testing alternatives all morning though and am ready to implement. However this brings up another point of discussion. I've opened another issue because it will impact more than just this change.

Related issue: #538

AWegnerGitHub commented 7 years ago

@Glorfindel83, yes it does. That httl is from https://metasmoke.erwaysoftware.com/post/51936

AWegnerGitHub commented 7 years ago

Closed with https://github.com/Charcoal-SE/SmokeDetector/commit/286008515030269cc52f0ce41d2f2da23b9a30a1