Closed LisDocument closed 2 years ago
@pvlugter mentioned something like this in his akka version. I’m not sure if it was the same accuracy observation.
I think this should be a benign change but we should run a few traces just for a quick sanity check. Thanks for pointing out this goof.
Thank you very much for your reply, I see that this issue is indeed mentioned in that issue. Consistent with my question, this can lead to delayed reset() execution and untrusted size variables. Thanks again for your reply
@LisDocument, @pvlugter: Can you provide a code snippet with your suggested fix?
The intent is to age the sketch by dividing all of the counters by two. This would mean that odd counters are treated the same as even ones (3/2 == 2/2 => 1). Therefore the number of odd counters is summed for a correction. As an item hashes into four counters, the odd counts are divided by four. The sampled size is then halved, minus the odd count skew.
I believe @pvlugter's suggestion is that the odd counts should be divided by eight, not four, in order to also halve them. This would make sense since we are aging all of the counters, but now over emphasizing odd counts. Therefore the change would be,
// Current
size = (size >>> 1) - (count >>> 2);
// Proposal
size = ((size - (count >>> 2)) >>> 1);
Is that correct?
/cc @gilga1983 who suggested this error correction trick when explaining his TinyLFU algorithm to me. I can see why the proposed changes would make sense, is was likely his intent, and my mistake.
The optimization is aimed to keep the number of items in the sketch equal. Since odd numbers are rounded down, for every odd number we reduce the number of items by one.I understands the problem now, (like I told you it is a new idea) the problem is that we update the total count for every individual counter, rather than for all counters.
So that technique do not work because every item touches 4 counts so we count more odd counters. It may be interesting to track the sum of all counters (that should be 4*number of items) and in that case subtracting 1 for every odd counter (pre reset) makes sense. We can also ignore that optimization, and just run on a slightly larger window. (if it raises hit rate it means that we benefit from larger windows).
Thank you for your reply, I'm sorry that I raised a question but didn't give a corresponding solution, in fact, my idea is the same as @pvlugter, I hope to divide by 8 instead of 4 when calculating odd numbers, and the one you gave The solution seems to me to fix the problem.
Yes, that change looks good for getting a more accurate size. And it's because every odd counter, when divided by two and rounded down, is off by 0.5 rather than 1 in the new size.
Simple example, and ignoring the depth and averaging over rows, if before reset the counters were 1 2 3 4 5 6 7 8 9
with size of 45
, then after reset the halved counters will be 0 1 1 2 2 3 3 4 4
and the new size should be 20
. The number of odd counters before reset is 5
so the adjusted size would currently be 45 / 2 - 5 = 17
while also halving the number of odd counters gives the correct size with (45 - 5) / 2 = 20
.
As it currently over-adjusts, it just delays the reset operation by a little, which would be similar to having a slightly larger sample size parameter.
Version 3.0.5 FrequencySketch will call reset() when size=sampleSize. Whether reset() is executed does not consider the addition of two odd numbers and then divided by 2, which may be different from the addition of two odd numbers divided by 2.
size = (size >>> 1) - (count >>> 2)
(7 + 7)/2 - 2 = 5 //now
7/2 + 7/2 = 6 //correct