Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
476 stars 182 forks source link

Integrate DeepSmoke #1014

Closed tripleee closed 6 years ago

tripleee commented 7 years ago

@tanmayb123 has created an API for us to query. StackOverflow posts only, for the time being.

The API is basically

99.239.154.69/dsd/index.php?q=[body270urlencoded]

... where body270urlencoded is the first 270 bytes of the post body.

Detailed transcript starting around here: https://chat.stackexchange.com/transcript/message/39458944#39458944

... but details are a bit further down.

angussidney commented 7 years ago

Before we implement this, I just want to check on performance. Currently, Smokey scans 35-45 posts per period of time (minute? second? @ArtOfCode- / @Undo1). When I watched Tanmay's video, the application seemed to take 2-3 seconds to return a response. Can @tanmayb123 let us know whether this was mainly due to post processing, or simply network latency? If it's the second, it shouldn't be an issue, but if it actually takes that long to compute, we may need to change the implementation. Maybe fire off a request with the body summary upon seeing a post, so that the check runs while the post is waiting in our API queue?

tripleee commented 7 years ago

It does take a long time to scan a message. I'm running the test suite on the simple straightforward PR I am preparing and it takes much longer than it used to because it performs a 2-3s query for each test case, and also producing false positives for many test posts.

tripleee commented 7 years ago

https://github.com/tripleee/SmokeDetector/tree/deepsmoke contains a simple straightforward implementation, but I'm hesitant to create a pull request out of it at this point because several test cases are failing.

test/test_regexes.py:96: AssertionError
----------------------------- Captured stdout call -----------------------------
[11:45:54] Max limit on number of concurrent ajax request
[11:45:54] Result:  ['Body classified as spam by DeepSmoke']
============== 20 failed, 197 passed, 3 skipped in 468.70 seconds ==============

Notice also that the test suite takes almost 10 minutes with this change.

ArtOfCode- commented 7 years ago

@tripleee I'd create a PR, but mark it as WIP. We can hold off merging, but having the PR makes commenting on things easier.

tripleee commented 7 years ago

I asked @tanmay123 in chat whether he could return a JSON response instead. That could also help reduce false positives if the response contained a confidence rating of some sort. From the demos I saw earlier, we could require at least 1% confidence (for example) but what exactly the threshold should be, I can't tell. Anyway, the way it is now, you just get an opaque black or white verdict which is wrong some of the time (it basically seems to return Spam for everything in my limited testing).

tripleee commented 7 years ago

For the record, I'm attaching a transcript of the tests from the test suite. The empty string and some other common test gases get a Spam verdict, but in fairness, there are also some strings which get Not Spam. I'm also inlining the result here as a code block so you don't have to download it just to look at it.

Spam ''
Spam '<p>bbbbbbbbbbbbbbbbbbbbbb</p>'
Spam 'Yay titles!'
Spam ''
Spam 'bbbbbbbbbbbabcdefghijklmnop'
Spam 'kkkkkkkkkkkkkkkkkkkkkkkkkkkk'
Spam ''
Spam 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbb'
Spam '99999999999'
Spam ''
Spam ''
Spam 'Spam spam spam'
Spam 'babylisscurl'
Spam ''
Spam 'Question'
Spam ''
Spam '111111111111111111111111111111111111'
Spam 'Question'
Spam ''
Not Spam 'I have this number: 111111111111111'
Not Spam 'Gmail Tech Support (1-844-202-5571) Gmail tech support number[Toll Free Number]?'
Spam ''
Spam ''
Not Spam '<>1 - 866-978-6819<>gmail password reset//gmail contact number//gmail customer service//gmail help number'
Spam ''
Spam ''
Not Spam 'Hotmail technical support1 - 844-780-67 62 telephone number Hotmail support helpline number'
Spam ''
Spam ''
Spam 'Valid title'
Spam ''
Not Spam 'Hotmail technical support1 - 844-780-67 62 telephone number Hotmail support helpline number'
Not Spam '[[[[[1-844-202-5571]]]]]Gmail Tech support[*]Gmail tech support number'
Spam ''
Spam ''
Not Spam '@@<>1 -866-978-6819 FREE<><><::::::@Gmail password recovery telephone number'
Spam ''
Spam ''
Not Spam '1 - 844-780-6762 outlook password recovery number-outlook password recovery contact number-outlook password recovery helpline number'
Spam ''
Spam ''
Spam 'hotmail customer <*<*<*[*[ 1 - 844-780-6762 *** support toll free number Hotmail Phone Number hotmail account recovery phone number'
Spam ''
Spam ''
Not Spam '1 - 844-780-6762 outlook phone number-outlook telephone number-outlook customer care helpline number'
Spam ''
Spam ''
Spam 'Repeating word word word word word word word word word'
Spam ''
Spam ''
Not Spam 'Visit this website: optimalstackfacts.net'
Spam ''
Spam ''
Spam 'asdf asdf asdf asdf asdf asdf asdf asdf'
Spam ''
Spam ''
Spam 'A title'
Spam ''
Spam '>>>>  http://'
Spam ''
Spam ''
Spam '<p>Test <a href="http://example.com/" rel="nofollow">some text</a> moo moo moo.</p><p>Another paragraph. Make it long enough to bring this comfortably over the 300-character limit. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt'
Spam 'spam'
Spam ''
Spam '>>>> http://'
Spam 'Another title'
Spam ''
Spam '<code>>>>>http://</code>'
Not Spam 'This asdf should asdf not asdf be asdf matched asdf because asdf the asdf words do not asdf follow on each asdf other.'
Spam ''
Spam ''
Not Spam 'This is a title.'
Spam ''
Not Spam 'This is a body.<pre>bbbbbbbbbbbbbb</pre>'
Not Spam 'This is another title.'
Spam ''
Not Spam 'This is another body. <code>bbbbbbbbbbbb</code>'
Not Spam 'Yet another title.'
Spam ''
Not Spam 'many whitespace             .'
Not Spam 'Perfectly valid title.'
Spam ''
Spam 'bbbbbbbbbbbbbbbbbbbbbb'
Spam 'Yay titles!'
Spam ''
Spam 'bbbbbbbbbbbabcdefghijklmnopqrstuvwxyz123456789a1b2c3d4e5'
Spam 'Long double'
Spam ''
Not Spam 'I have this value: 9999999999999999'
Not Spam 'Another valid title.'
Spam ''
Spam 'asdf asdf asdf asdf asdf asdf asdf asdf asdf'
Spam 'Array question'
Spam ''
Not Spam 'I have an array with these values: 10 10 10 10 10 10 10 10 10 10 10 10'
Spam 'Array question'
Spam ''
Not Spam 'I have an array with these values: 0 0 0 0 0 0 0 0 0 0 0 0'
Not Spam 'his email address is (SOMEONE@GMAIL.COM)'
Spam ''
Spam ''
Spam 'something'
Spam ''
Not Spam 'his email address is (SOMEONE@GMAIL.COM)'
Spam 'Title here'
Spam ''
Not Spam '<img src="http://example.com/11111111111.jpg" alt="my image">'
Spam 'Title here'
Spam ''
Not Spam '<img src="http://example.com/11111111111111.jpg" alt="my image" />'
Spam 'Title here'
Spam ''
Spam '<a href="http://example.com/11111111111111.html">page</a>'
Spam 'Error: 2147467259'
Spam ''
Spam ''
Not Spam 'Max limit on number of concurrent ajax request'
Spam 'Price Buy'
Not Spam '<p>Php java script boring yaaarrr <a href="http://www.price-buy.com/" rel="nofollow noreferrer">Price-Buy.com</a> </p>'
Not Spam 'Max limit on number of concurrent ajax request'
Not Spam 'Totally Unrelated Username'
Not Spam '<p>Php java script boring yaaarrr <a href="http://www.price-buy.com/" rel="nofollow noreferrer">Price-Buy.com</a> </p>'
Spam 'kkkkkkkkkkkkkkkkkkkkkkkkkkkk'
Spam ''
Spam '<p>bbbbbbbbbbbbbbbbbbbbbb</p>'
Spam ''
Spam "('1', 'stackoverflow.com')"
Spam ''

deepsmoke.txt

tanmayb123 commented 7 years ago

@tripleee Sure, I can return JSON. I'll notify when completed.

tanmayb123 commented 7 years ago

@angussidney There's a catch: the architecture looks like this: PHP Interface -> Python -> Tensorflow Initialization -> Keras Initialization -> Prediction -> PHP Interface The Tensorflow and Keras Initialization take most of the time. The prediction takes practically none. So, if you can skip the init overhead by accessing the source directly, the performance goes up drastically.

angussidney commented 7 years ago

@tanmayb123 so how could we skip that? Is there a way in which we could run the predictions locally on Smokey, or for your Python script to stay running and simply take input from the PHP script without reinitialising every time?

tanmayb123 commented 7 years ago

So theoretically, if I host the HTTP Server from within Python and allow it to access an already-initialized Tensorflow environment, I can have the performance go up. I'll try that out tomorrow.

tanmayb123 commented 7 years ago

Ah you posted just before me. I think either implementing it natively into Smokey or the already initialized HTTP server is a good idea.

angussidney commented 7 years ago

If you're fine to continue hosting, I think that would be a better idea, as it would be a pain to get all of the Smokey instances updated to have all of the dependencies installed and initialised.

tanmayb123 commented 7 years ago

I can keep hosting. I'll move it over to a DigitalOcean server once I get the faster API ready, though, as it's currently running on my local machine.

tripleee commented 7 years ago

An alternative integration point could be in Halflife, my plan is to have it report results to chat once it's done querying etc so this could be just another query it performs. That would remove the requirement to immediately integrate anywhere in the Smoke stack proper, and just supply the chat room with a bit more background for each recent post (I suppose regardless of whether it was already flagged or not).

tripleee commented 7 years ago

If you plan to move it to a different IP address, that would be another reason to set up a host name so that we don't have to make code changes when the address changes.

Undo1 commented 7 years ago

Looks awesome. API support is probably the way to go here (my Pi likely wouldn't be happy running this), as it allows us to transition hosts easily. For now, an API is fine - but it has to stay marked experimental until we figure out a long term plan.

teward commented 7 years ago

@Undo space is always available on one of my two systems to run this in a container on one of my beefier VDSes from RamNode - 8GB of RAM on each box. If we need to move this to one of those systems we can and we can launch two instances if we need backup failover power. This is why Solar Flare hasn't died yet, and Lunar Eclipse is coming up in another couple days.

AWegnerGitHub commented 7 years ago

Questions:

tripleee commented 7 years ago

Good questions, but I'd try to get it up and running experimentally first, and then if it proves valuable, think about making it solid enough for production.

ArtOfCode- commented 7 years ago

@tanmayb123 as far as implementation on that goes, you can use Flask or Django to serve HTTP requests from Python. I'd probably recommend Flask, being more lightweight.

tripleee commented 7 years ago

(Flask is arguably overkill for wrapping an existing function. Anything which offers (u)wsgi/fastcgi functionality should be fine. I like Bottle as a simpler alternative to Flask but you could probably get away with something even simpler.)

AWegnerGitHub commented 7 years ago

@tripleee Assuming it can be tuned a bit to catch the edge cases mentioned above, I believe it will be valuable. However, we need to know about the rate limits up front, especially if we are going to be throwing all posts that Smokey sees against this thing.

I watched the key note presentation and it was misleadingly stated that this could handle 30 spam posts a day. Sure, it can handle 30 posts, but those 30 posts are in a sea of ~10,000 posts. It needs to be able to handle that volume to pick out the small percentage of bad posts.

tanmayb123 commented 7 years ago

@AWegnerGithub there are no rate limits. You can use this as often and as much as you’d like. Also, how many spam posts does Stack Overflow get in 1 day?

tanmayb123 commented 7 years ago

@tripleee getting the API setup with bottle.

tripleee commented 7 years ago

Andy's point was that it needs to scale to the number of posts, not just the number of spam posts. The 30/day seems roughly correct (actually slightly inflated) but we need to query it on the order of 10,000 times per day.

AWegnerGitHub commented 7 years ago

@tanmayb123 The numbers provided by @tripleee are correct. In the last few weeks, it looks like August 3rd was the high point of spam on Stack Overflow. Smoke Detector saw 25 spam posts. At the same time, there were a minimum of 20524 posts created that day (remember, SEDE doesn't show deleted posts, so that number could be higher).

To fully utilize this, we'd need to be able to scan 20,524 posts and detect those 25 spam posts.

tanmayb123 commented 7 years ago

Sorry, I meant to say that the API can be used however many times you'd like and as often you'd like for all types of posts, not just spam posts.

AWegnerGitHub commented 7 years ago

Assuming the numbers above hold steady and we only use this for SO - so 20,000ish posts a day - what is the pricing for the services this utilizes after bluemix's 30 trial period is up?

AWegnerGitHub commented 7 years ago

Apparently I missed something in the pricing guides. Doesn't bluemix/watson have a cost associated with it?

tanmayb123 commented 7 years ago

@AWegnerGitHub Don't worry about the pricing - leave that to me.

AWegnerGitHub commented 7 years ago

We can't take advantage of your generosity nor can we expect you to cover costs, especially if they are significant, indefinitely. I'd prefer to know costs up front, even if you or someone else opts to cover it for now.

tripleee commented 7 years ago

I had to make a couple of small adaptations to the WIP because of changes in the values I get back from the PHP server. The test suite passes now, with no false positives. It still takes almost 500 seconds.

I am also running a quick PoC with the ws.py fetcher from the old sdml project and it seems to be keeping up nicely even with the current overhead (scanning *.stackoverflow.com posts only). The Stack Exchange API throttling etc seems to keep things going at a pace where 2-3 seconds scan time is not a major blocker. (180 posts scanned so far; no false positives there either as far as I can tell.)

tripleee commented 7 years ago

I have now scanned 713 messages and counting, with no hiccups and no false positives.

As long as the scanning takes place in a separate thread, adding a bit of latency to the scan task does not seem to create any technical problems (as long as we have fewer ongoing tasks than the system can have running threads!) though of course it would be nice to get the Smokey result for a post as quickly as possible.

tripleee commented 7 years ago

Terminated my experiment after 1248 messages. Disappointingly, nothing flagged (though from monitoring Charcoal and SOCVR I guess there wasn't anything it should have flagged, either).

AWegnerGitHub commented 7 years ago

@tripleee Any chance you could run this during high volume spam hours?

Undo1 commented 7 years ago

If it'd help, I can spin up an Azure machine for a while and give you SSH creds. Let me know if that'd be useful, the credit is just sitting there.

tripleee commented 7 years ago

@Undo1 if that was meant for me, sure, I could give it a spin, maybe tomorrow (bedtime soon here).

tripleee commented 7 years ago

I have sent you a public key to the email address I found in many of your Github commits.

The email comes from charcoal@ a Fastmail-hosted domain (Received: headers will indicate messagingengine.com)

Seeing as Azure doesn't seem to offer Debian images, Ubuntu would be the most familiar out of the ones they do provide.

I'm not sure how it makes sense to decide which region to create the instance in. For what it's worth I'm in northern Europe.

Ref https://chat.stackexchange.com/transcript/message/39559505#39559505

tripleee commented 7 years ago

I created a separate chat room for coordinating this. I'll try to keep pertinent notes here in the ticket but routine reports about further experiments etc should probably be kept in the chat room. https://chat.stackexchange.com/rooms/64277/deepsmoke-a-charcoal-subsidiary

tripleee commented 7 years ago

The one I ran yesterday PM on a server was refactored to run standalone without the various things I have installed on my laptop and I accidentally dropped the SO-only code snippet in the process. Thus, it fetched around 6,500 messages from all sites in the network during the roughly 5 hours it was running, but it did not detect a single spam incident. (It eventually terminated because of a network problem of some sort around 18 UTC.)

I now have the same thing running on Azure with credentials I received from Undo earlier today. But it seems that the PHP side somehow lost its detection capabilities last Thursday, so this experiment is probably pointless until that is fixed.

tripleee commented 7 years ago

The PHP instance died some time ago. Tanmay has been active in the chat room recently and making noises which I interpret to mean we should be able to revive this once he gets a new spiffy instance up with a new spiffy JSON API but there is no firm timetable.

tripleee commented 7 years ago

Actually, it turns out that Watson is not a runtime requirement at all. You can run it with just Python and a few common add-on libraries, once you have the model. (Even training doesn't require Watson, I guess.) The source distribution doesn't include a model but Tanmay has been sharing one which however seems to be too aggressive.

tripleee commented 7 years ago

I have revived the demo in Undo's Azure instance. Results are being exposed to a web socket ws://smokey-deepsmoke2903.cloudapp.net:8888/

More details at https://chat.stackexchange.com/transcript/message/40147025#40147025

So far, 300 posts scanned, no spam detected by Deepsmoke, at least one missed.

The stderr output is in some ways more informative because it includes the number of posts it has detected, but I'll keep the JSON output on the web socket at least for the time being. Perhaps we can come up with something a little more sophisticated.

tripleee commented 7 years ago

I fixed a stupid bug in my integration, so we now have a test running since a few days back. There have been on the order of two true positives and hundreds of false positives, but we could probably tweak some parameters to the point where there is more signal than noise. And anyway, @tanmayb123 is allegedly working on a revamp of the underlying models which could change the game significantly.

@Fortunate-MAN contributed a bot called PulseMonitor which is now posting notices in yet another chat room (to keep the test results separate from the deepsmoke discussion room) called Charcoal Test, here: https://chat.stackexchange.com/rooms/65945/charcoal-test

The ad hoc infrastructure for this is sprinkled across the Deepsmoke chat room. Here's a quick summary:

tripleee commented 7 years ago

Deepsmoke is now running with the 20,000 API quota but it still ran out of requests. If we want to continue this experiment for long, I should upgrade deepsmoke-ws.py to use the batch operations from the main Smoke Detector code base, but for the time being, it's just doing an unsophisticated listening operation on the Stack Exchange API socket. It will thus require manual restarts, although I'll think about setting up a script of some sort if we want to keep it running.