Closed Glorfindel83 closed 7 years ago
Sounds like a good idea
Actually, having assigned this to myself, I've just realised this isn't currently possible. We only check one of username/title/body/summary at a time, so there's no point when check code has access to both.
@ArtOfCode- Wouldn't it be possible to schedule the Username check before the body check and save the username temporarily so you can access it in the body check?
@magisch Possibly. Would have to look at that.
Sounds messy, I'd probably be against that. It'd be better to just make a new reason-method type that takes all parts of the post at once.
@undo1 agree
Are there other test cases to run against? Right now I am using the following tests:
checks = [
("http://www.price-buy.com/", "Price Buy"),
("https://thebestparkourgear.com/backpack-for-parkour/", "TheBestParkourGear"),
("httl://bestonwardticket.com", "Best onward Ticket"),
("https://i.stack.imgur.com/eS6WQ.jpg", "Best onward Ticket"),
("www.stackoverflow.com", "Andy"),
("www.stackoverflow.notarealtld", "Andy"),
("stackoverflow.notarealtld", "Andy"),
("http://stackoverflow.notarealtld", "Andy"),
("httl://stackoverflow.notarealtld", "Andy"),
]
I get the following results:
SIMILAR: (1.0) => Name: Price Buy, Domain: http://www.price-buy.com/
SIMILAR: (1.0) => Name: TheBestParkourGear, Domain: https://thebestparkourgear.com/backpack-for-parkour/
SIMILAR: (1.0) => Name: Best onward Ticket, Domain: httl://bestonwardticket.com
NOT SIMILAR: (0.0952380952381) => Name: Best onward Ticket, Domain: https://i.stack.imgur.com/eS6WQ.jpg
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.com
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.notarealtld
NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: http://stackoverflow.notarealtld
NOT SIMILAR: (0.0) => Name: Andy, Domain: httl://stackoverflow.notarealtld
It's a little messier than I thought it'd be, and does require a library be added to Smokey, but it works. My tests have been pretty simple so far. I've only passed the domain, not the entire body of the text. Doing that will require an HTML parser (likely BeautifulSoup), so that'd need to be included too.
What I need:
That looks awesome. We already have beautifulsoup (4, I think?), and that TLD library is tiny. Have the code for this in a branch somewhere?
On Tue, Feb 21, 2017, 7:34 AM A Wegner notifications@github.com wrote:
Are there other test cases to run against? Right now I am using the following tests:
checks = [ ("http://www.price-buy.com/", "Price Buy"), ("https://thebestparkourgear.com/backpack-for-parkour/", "TheBestParkourGear"), ("httl://bestonwardticket.com", "Best onward Ticket"), ("https://i.stack.imgur.com/eS6WQ.jpg", "Best onward Ticket"), ("www.stackoverflow.com", "Andy"), ("www.stackoverflow.notarealtld", "Andy"), ("stackoverflow.notarealtld", "Andy"), ("http://stackoverflow.notarealtld", "Andy"), ("httl://stackoverflow.notarealtld", "Andy"), ]
I get the following results:
SIMILAR: (1.0) => Name: Price Buy, Domain: http://www.price-buy.com/ SIMILAR: (1.0) => Name: TheBestParkourGear, Domain: https://thebestparkourgear.com/backpack-for-parkour/ SIMILAR: (1.0) => Name: Best onward Ticket, Domain: httl://bestonwardticket.com NOT SIMILAR: (0.0952380952381) => Name: Best onward Ticket, Domain: https://i.stack.imgur.com/eS6WQ.jpg NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.com NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: www.stackoverflow.notarealtld NOT SIMILAR: (0.117647058824) => Name: Andy, Domain: stackoverflow.notarealtld NOT SIMILAR: (0.0) => Name: Andy, Domain: http://stackoverflow.notarealtld NOT SIMILAR: (0.0) => Name: Andy, Domain: httl://stackoverflow.notarealtld
It's a little messier than I thought it'd be, and does require a library be added to Smokey https://pypi.python.org/pypi/tld, but it works. My tests have been pretty simple so far. I've only passed the domain, not the entire body of the text. Doing that will require an HTML parser (likely BeautifulSoup), so that'd need to be included too.
What I need:
- The OK to include at least 1 new library: tld https://pypi.python.org/pypi/tld. If we don't already include BeautifulSoup, we also need to include that for parsing the links out of the body.
- More test cases so I can throw those into here and make sure I'm not missing any other cases.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/Charcoal-SE/SmokeDetector/issues/450#issuecomment-281360868, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7FZuJHsM3EApNMozuFGhcHF6gYUK0mks5revXkgaJpZM4LfV4q .
Good job! Here's another TP from today: https://metasmoke.erwaysoftware.com/post/58200
Also, one of your testcases has a httl://
. I don't know that scheme.
@Glorfindel83 HyperText Testing Language
No, no branch yet. I've been testing alternatives all morning though and am ready to implement. However this brings up another point of discussion. I've opened another issue because it will impact more than just this change.
Related issue: #538
@Glorfindel83, yes it does. That httl
is from https://metasmoke.erwaysoftware.com/post/51936
E.g. for these kind of spam posts, which go undetected quite often or https://metasmoke.erwaysoftware.com/post/52946 https://metasmoke.erwaysoftware.com/post/52841 https://metasmoke.erwaysoftware.com/post/51936
Procedure: replace spaces in username by \W? and check if there's a link in the post which contains that string. There are some users with 3 character usernames which have a chance of accidentally triggering the filter. Maybe this should only work for usernames above a certain length.