Charcoal-SE / metasmoke

Web dashboard for SmokeDetector.
https://metasmoke.erwaysoftware.com
Creative Commons Zero v1.0 Universal
43 stars 34 forks source link

MS places max 3 autoflags on posts with > (about) 1000 weight #599

Open makyen opened 5 years ago

makyen commented 5 years ago

Several times, I've noticed that MS doesn't place more than 3 autoflags on SD reports where the weight is > (about) 1,000. In this screenshot, which is of a search of SD reports in CHQ (sorted by decreasing weight, as reported by AIM), the only SD reports with 4 autoflags and > 1,000 weight – AIM reports the current weight from the MS API – are ones where the SD reports indicate the posts had < 1,000 weight at the time they were reported.

While the above search was not comprehensive, it does indicate there's an issue. It's would be quite reasonable that not all of the > 1,000 weight posts get 4 autoflags, as there can be various reasons for more than N autoflags not being possible on a particular post (e.g. the post already has other spam flags and is deleted with fewer autoflags from MS). However, I would expect that most of them would have 4 autoflags raised, not none of them.

ArtOfCode- commented 5 years ago

Maybe [status-bydesign], it's done by % certainty not by weight. Need to look into it in more detail, might have some time this week.

Undo1 commented 5 years ago

It's possible that auto flag accuracies overall have changed in a way that caps most reasons to below the 4-flag threshold. I think we (I, if I find time [hah!]) should look at the effects of changing the window for reason_weight in some way - possibly only look at the last n months, or weight more recent posts differently.

makyen commented 5 years ago

@Undo1 There have been a significant number of posts that received 4-flags (e.g. 2 in the last hour). For some reason, it just doesn't happen for our top weighted MS posts. I know that it's supposed to be based completely on the historical proportion of posts which match or exceed the criteria that are TP, but something is causing it to cap-out.

One possibility is that there just aren't that many posts which are that high of a weight. Does the formula require that there are enough total posts to meet some criteria (e.g. for setting a flag condition there must be 1000 matching posts). Is the calculation constrained to a certain number of significant figures and/or having a problem with floating point rounding (although, 0 FP should give some simple numbers)?

Overall accuracy of determining the break-points for autoflag/not autoflag:

The real solution to the problem of weight/detection information not necessarily being accurate is to periodically have SD run through all of the posts on MS and generate new detections (or no detections) for every post. The detections in SD are a moving target. Without doing this, we will always be significantly off when using the historical list of detections to determine how accurate the detection reasons are, and what weights actually indicate wrt. the likelihood that something is spam.

Having another, currently inactive, SD instance just run through the MS posts (fetched either through the MS API, or against the most recent database dump) is something that should happen (ideally automatically) on about a weekly basis, or at least monthly. The resulting current detections and why should then be updated in MS, while retaining the original detections and why for historical analysis.

Having the capability of SD running through the old posts would also allow us to test the effects of changes to detections based on the data we already have (e.g. instead of updating MS, we could generate a report of the differences in the detections from most recent run to change being tested). Right now, we make changes and just hope that it does a better job. Obviously, some changes can only really be reflected in new data (due to the existing selection bias in the MS post data), but other changes would certainly benefit. Being able to do this would allow much better feedback when trying to tune detections to more accurately detect TP and reject FP.

Hmmm... It's debatable if auxiliary data which SD determines to check for detections should be stored with the original detections (e.g. what ASN was found, or the IP to which a domain resolved, perspective score, etc.), as those are not in the SE data, but if they change, the result of the detection could change & we'd end up trying to track the difference down.

Having SD periodically run through the old data using current detections is the only way to have data that reflects the current state of SD. Without that accurate data, we will always have a harder time fitting an automatic is spam/is not spam decision matrix.

Undo1 commented 5 years ago

That's a really interesting idea. Selection bias in MS data would skew the statistical value of that kind of system, but it would probably still be practically useful.

ArtOfCode- commented 5 years ago

This is the scatter plot of weight/flag count.. There certainly are instances of 4-flag posts in the 1000+ range, but they do seem less common than I'd expect.

ArtOfCode- commented 5 years ago

and this is the list of instances of 1000+ weight posts without 4 autoflags on them - 297 instances.

ArtOfCode- commented 5 years ago

Bug confirmed - I ran the numbers, and every single one of those posts was 100.00% certainty - i.e. should've been eligible for 4 flags.

[[121116, 100.0], [121196, 100.0], [121199, 100.0], [121212, 100.0], [121295, 100.0], [121423, 100.0], [121556, 100.0], [121625, 100.0], [121791, 100.0], [122103, 100.0], [122186, 100.0], [122208, 100.0], [122469, 100.0], [122614, 100.0], [122615, 100.0], [122848, 100.0], [122899, 100.0], [122907, 100.0], [122961, 100.0], [122978, 100.0], [122985, 100.0], [123478, 100.0], [123491, 100.0], [123767, 100.0], [124042, 100.0], [124062, 100.0], [124071, 100.0], [124088, 100.0], [124186, 100.0], [124199, 100.0], [124236, 100.0], [124313, 100.0], [124412, 100.0], [124425, 100.0], [124434, 100.0], [124436, 100.0], [124467, 100.0], [124518, 100.0], [124522, 100.0], [124533, 100.0], [124536, 100.0], [124598, 100.0], [124683, 100.0], [124928, 100.0], [125890, 100.0], [126013, 100.0], [127908, 100.0], [128455, 100.0], [130010, 100.0], [130477, 100.0], [130809, 100.0], [130899, 100.0], [131013, 100.0], [131289, 100.0], [131801, 100.0], [131942, 100.0], [132167, 100.0], [132358, 100.0], [132989, 100.0], [134194, 100.0], [134219, 100.0], [134311, 100.0], [134374, 100.0], [134472, 100.0], [134492, 100.0], [134639, 100.0], [134729, 100.0], [135397, 100.0], [136331, 100.0], [136645, 100.0], [137012, 100.0], [137138, 100.0], [137587, 100.0], [137951, 100.0], [138016, 100.0], [138293, 100.0], [138599, 100.0], [139002, 100.0], [139095, 100.0], [139431, 100.0], [139591, 100.0], [139632, 100.0], [139637, 100.0], [139781, 100.0], [139939, 100.0], [140064, 100.0], [140261, 100.0], [141001, 100.0], [141126, 100.0], [141205, 100.0], [141210, 100.0], [141407, 100.0], [141510, 100.0], [141518, 100.0], [141537, 100.0], [141547, 100.0], [141980, 100.0], [142085, 100.0], [142417, 100.0], [142469, 100.0], [142667, 100.0], [142680, 100.0], [142693, 100.0], [142822, 100.0], [142978, 100.0], [143107, 100.0], [143865, 100.0], [144101, 100.0], [144115, 100.0], [144339, 100.0], [144689, 100.0], [144694, 100.0], [144727, 100.0], [144768, 100.0], [144903, 100.0], [144951, 100.0], [145567, 100.0], [145579, 100.0], [145595, 100.0], [145822, 100.0], [145884, 100.0], [145989, 100.0], [146009, 100.0], [146029, 100.0], [146297, 100.0], [146392, 100.0], [146666, 100.0], [146727, 100.0], [146877, 100.0], [147309, 100.0], [147328, 100.0], [147609, 100.0], [148064, 100.0], [148108, 100.0], [148254, 100.0], [148299, 100.0], [148492, 100.0], [148727, 100.0], [148854, 100.0], [149319, 100.0], [149360, 100.0], [149483, 100.0], [149694, 100.0], [149698, 100.0], [149712, 100.0], [149718, 100.0], [149751, 100.0], [149893, 100.0], [150101, 100.0], [150133, 100.0], [150236, 100.0], [150237, 100.0], [150240, 100.0], [150258, 100.0], [150263, 100.0], [150277, 100.0], [150296, 100.0], [150352, 100.0], [150370, 100.0], [150371, 100.0], [150489, 100.0], [150662, 100.0], [150794, 100.0], [150950, 100.0], [150977, 100.0], [150989, 100.0], [151022, 100.0], [151128, 100.0], [151135, 100.0], [151149, 100.0], [151290, 100.0], [151291, 100.0], [151353, 100.0], [151421, 100.0], [151423, 100.0], [151582, 100.0], [151628, 100.0], [151828, 100.0], [151882, 100.0], [152185, 100.0], [152208, 100.0], [152475, 100.0], [152888, 100.0], [152894, 100.0], [153075, 100.0], [153116, 100.0], [153119, 100.0], [153138, 100.0], [153153, 100.0], [153464, 100.0], [153494, 100.0], [153542, 100.0], [153698, 100.0], [153935, 100.0], [153937, 100.0], [154184, 100.0], [154416, 100.0], [154442, 100.0], [154560, 100.0], [155117, 100.0], [155142, 100.0], [155498, 100.0], [156091, 100.0], [156244, 100.0], [156498, 100.0], [156519, 100.0], [156781, 100.0], [156981, 100.0], [157187, 100.0], [157195, 100.0], [157422, 100.0], [157702, 100.0], [157806, 100.0], [157851, 100.0], [158074, 100.0], [158084, 100.0], [158096, 100.0], [158214, 100.0], [158249, 100.0], [158429, 100.0], [158875, 100.0], [159018, 100.0], [159026, 100.0], [159230, 100.0], [159250, 100.0], [159376, 100.0], [159638, 100.0], [159979, 100.0], [160032, 100.0], [160267, 100.0], [160297, 100.0], [160458, 100.0], [160493, 100.0], [160715, 100.0], [160869, 100.0], [160908, 100.0], [161134, 100.0], [161245, 100.0], [161254, 100.0], [161296, 100.0], [161317, 100.0], [161396, 100.0], [161410, 100.0], [161418, 100.0], [161506, 100.0], [161511, 100.0], [161518, 100.0], [161565, 100.0], [161729, 100.0], [161972, 100.0], [162221, 100.0], [162623, 100.0], [162865, 100.0], [162872, 100.0], [162908, 100.0], [162969, 100.0], [163150, 100.0], [163351, 100.0], [163485, 100.0], [163502, 100.0], [163564, 100.0], [164160, 100.0], [164545, 100.0], [164555, 100.0], [164581, 100.0], [164583, 100.0], [164717, 100.0], [164812, 100.0], [165097, 100.0], [165103, 100.0], [165120, 100.0], [165122, 100.0], [165132, 100.0]]
Undo1 commented 5 years ago

I can't look closely right now, but just saw the notif on my phone... sounds suspiciously like string sorting on "100.0" vs. "99.9"

makyen commented 5 years ago

@ArtOfCode- I'm seeing somewhat different numbers. Based on the posts and # flags listed in the scatter plot you linked, there are 1,115 posts with weight > 1,000. Of those, 23 have 4 flags.

To be complete:

# Flags # posts
6 0
5 6
4 23
3 1,019
2 23
1 44

One situation that's known to result in 4 flags on a post with a current weight > 1k is when the weight at the time the post was reported was < 1k. There were a few examples of that in the screenshot linked in the original issue comment (based on the weight shown in the SD report and current weight obtained by AIM). In the screenshot, the most common difference is about 200 weight between when reported and current.

Given that there are also six with 5 flags, some of the 4-flags may have been during the 5-flag experiment.

It would be interesting to filter out posts where those two reasons were the cause of the 4-flags to see if it accounts for all 23 4-flag posts. However, that's mostly just an academic interest. It doesn't really matter, unless it helps track down the reason for the bug, which I doubt it will.

tripleee commented 5 years ago

Related? https://github.com/Charcoal-SE/metasmoke/issues/625