dialtr / libcount

C/C++ Implementation of the HyperLogLog++ cardinality estimation algorithm.
Apache License 2.0
26 stars 10 forks source link

linear interpolation leads to some crazy estimates #14

Open astolman opened 8 years ago

astolman commented 8 years ago

The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.

dialtr commented 8 years ago

That sounds very reasonable. I'll work on a fix.

Sent from my iPhone

On Aug 16, 2016, at 5:28 PM, astolman notifications@github.com wrote:

The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dialtr commented 7 years ago

Hey Andrew, I finally got around to a fix for this, would love to get your review on it. The branch is:

tdial-interpolation-fix

I am working on certification tests next.

dialtr commented 7 years ago

Actually, hold off. There's an issue with the data in the tables.

astolman commented 7 years ago

I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail https://github.com/astolman/HyperHeadTail in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.

On Mar 6, 2017, at 1:29 AM, Tom Dial notifications@github.com wrote:

Actually, hold off. There's an issue with the data in the tables.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284344960, or mute the thread https://github.com/notifications/unsubscribe-auth/AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ.

astolman commented 7 years ago

I think these cases work out so that your cases are log of mine. I think I did that as part of fixing the problem with the first bug report I submitted.

As far as my use for libcount, I did a summer project at Sandia National Labs where I used it for a streaming algorithm. I’m still (slowly) writing a paper for it that should be up on ArXiv soon.

On Mar 6, 2017, at 2:13 PM, Tom Dial notifications@github.com wrote:

P.S.

Unless I am mistaken, in your version of EMP_alpha(), I think you have the first three switch cases(16, 32, 64) wrong.

Or am I missing something?

TD

On Mon, Mar 6, 2017 at 4:26 PM, astolman notifications@github.com wrote:

I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail < https://github.com/astolman/HyperHeadTail> in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.

On Mar 6, 2017, at 1:29 AM, Tom Dial notifications@github.com wrote:

Actually, hold off. There's an issue with the data in the tables.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/dialtr/libcount/issues/14#issuecomment-284344960>, or mute the thread https://github.com/notifications/unsubscribe-auth/ AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284538065, or mute the thread https://github.com/notifications/unsubscribe-auth/AH5tLKp-pvOtxKkR98tcuw1Sf6y5186Mks5rjHoUgaJpZM4Jl3qJ .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284550857, or mute the thread https://github.com/notifications/unsubscribe-auth/AS0tCpHeAL8_G6MIzeJcj6u7_vGPc3iRks5rjITtgaJpZM4Jl3qJ.

dialtr commented 7 years ago

Linear interpolation fix completed. Note that the empirical data is now sorted properly. Also added a simple certification test. Just run:

make certify