Closed quartata closed 7 years ago
What value does this add or what problem does it solve? There's significant work involved in this, and while I'm not opposed in the slightest to doing it, I'd kinda like there to be some point to it.
@ArtOfCode Although (as far as I'm aware) each check_if_spam
runs in its own thread, FindSpam
is totally sequential, applying each rule in order. One of my nightmare fears is that some complicated rule on more complicated posts, combined with CPython slowness, will slow everything down without us potentially noticing. Aside from just in general being useful data for tweaking our rules, it could be an early alarm for some rule going bad.
Regarding "significant work involved" -- just how much? I know how to do it on Smokey's side after reading the source, but I don't know what it would look like on Metasmoke. If it's too much trouble to put on MS it could be a chat command instead like why
.
It's definitely possible to do on the metasmoke side - probably the easiest way is to make the posts-reasons many-many relationship explicit via a join model, and add an execution_time
attribute to that model. Possible, but is certainly not a trivial refactor.
I suspect the same is true of Smokey, actually - regardless of what the source looks like, there's always more complexity than you think there is with Smokey. It may not be a ton of work, but probably more than you think.
On the other hand, your point about it being a valuable service health metric is valid. I'm happy to do the metasmoke side, if someone else wants to tackle Smokey.
Furthermore, considering just how many posts are posted on SE, Smokey strikes me as being psuedo-soft real time. I think keeping time/performance in mind is a good thing as our stack of regexs slowly engulfs the world's skyscrapers.
@ArtOfCode- Cool. I'll take a stab at the Smokey side anyways -- I'll assume that the data will go into an dictionary execution_times
(mapping reason names to time spent) in the JSON sent to /posts.json
for now.
@quartata Might be best to hold off until I can stab metasmoke; it may be far easier to have the data submitted slightly differently.
This seems like a solution in search of a problem.
Smokey strikes me as being psuedo-soft real time. I think keeping time/performance in mind is a good thing as our stack of regexs slowly engulfs the world's skyscrapers.
We aren't realtime. We have built in delays to fill API queues, we have manual validation of reports. What is measuring a few milliseconds going to get us other than a few numbers to look at, more processing time, and a potential hang up point.
Additionally, smokey runs on different hardware. Timings will be different on each. Will that be accounted for?
@AWegnerGitHub Sure the report validation is manual, but it's still important to get the post in front of a human's face in a reasonable amount of time.
Regarding the accuracy of timings, I'm not really concerned with small millisecond differences. This is really about seeing orders of magnitude differences, and neither noise nor small hardware differences will affect that (especially considering most small C-level time noise will be drowned out by CPython function call overhead most likely). Plus if we're assessing a rule's performance overall, we'd probably average the times on each Smokey instance anyways.
I'd also like the note that the milliseconds I quoted in the example were just random. A single rule test is more likely to be ~20-30 ms -- not a few.
Agree that this'd be interesting data to have, but it'd probably be best reported as min|q1|med|q3|max for each reason, maybe in the 10-min stats ping cycle. Metasmoke only has post data for things we report, which are by no means representative of the posts we see. Storing data for every post we see isn't gonna happen - it'd be better to just throw links to the post in chat if we see extremely long processing times.
Optimizing for ~20-30 ms per rule seems very excessive. A quick count shows we have about 65 rules. This averages out to less than 2 seconds per post.
FindSpam
...that's going to be a bigger delayIf you want performance improvements, focus on speeding up the rules themselves. I imagine we can have much more efficient regular expressions or custom methods.
Additionally, we already have some data on this. https://metasmoke.erwaysoftware.com/smoke_detector/2/statistics, for example, shows that Art's instance is able to process ~40-80 posts/second (per thread). That's already down to an average of ~25ms for the whole suite.
My Pi, probably the slowest hardware we run, is at ~6/sec - which is still only ~170ms for the whole suite.
@AWegnerGithub "focus on speeding up the rules themselves" That's the point. That's why I suggested the breakdown for each rule.
@Undo1 That's faster than I thought, although I suspect it varies depending on the post type.
Possibly, but that's an average. Some are faster, some are slower.
Discussion seems to be pretty dead here.
We could probably change this in the new smokey anyway
@angussidney No, this requires Metasmoke changes.
@quartata We could track stats like this locally to the new Smokey, since it's not really essential to publish them real-time.
I figure a picture is worth a thousand words, so here's two thousand words (mind the lack of CSS):
and
As our spam rules get more and more complicated, I think it would be worthwhile to include some basic benchmarking data on MS with each post. We could show a grand total for how long it took for SmokeDetector to run every applicable rule over the post beneath the time reported, along with a spoiler showing the breakdown of how long each rule took.
This would require some basic timings on SmokeDetector to be sent to MS with the report, but I suspect there will be more work on the MS end than the Smokey end which is why I've submitted it here.