Open astolman opened 8 years ago
That sounds very reasonable. I'll work on a fix.
Sent from my iPhone
On Aug 16, 2016, at 5:28 PM, astolman notifications@github.com wrote:
The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hey Andrew, I finally got around to a fix for this, would love to get your review on it. The branch is:
tdial-interpolation-fix
I am working on certification tests next.
Actually, hold off. There's an issue with the data in the tables.
I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail https://github.com/astolman/HyperHeadTail in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.
On Mar 6, 2017, at 1:29 AM, Tom Dial notifications@github.com wrote:
Actually, hold off. There's an issue with the data in the tables.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284344960, or mute the thread https://github.com/notifications/unsubscribe-auth/AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ.
I think these cases work out so that your cases are log of mine. I think I did that as part of fixing the problem with the first bug report I submitted.
As far as my use for libcount, I did a summer project at Sandia National Labs where I used it for a streaming algorithm. I’m still (slowly) writing a paper for it that should be up on ArXiv soon.
On Mar 6, 2017, at 2:13 PM, Tom Dial notifications@github.com wrote:
P.S.
Unless I am mistaken, in your version of EMP_alpha(), I think you have the first three switch cases(16, 32, 64) wrong.
Or am I missing something?
TD
On Mon, Mar 6, 2017 at 4:26 PM, astolman notifications@github.com wrote:
I wrote a fix for it in my adaptation of libcount. I think I had to sort the data that you used because it wasn’t sorted in a few cases. You can see what I did in my project https://github.com/astolman/HyperHeadTail < https://github.com/astolman/HyperHeadTail> in the shll folder there. I needed a HyperLogLog implementation that supported a sliding window so I heavily rewrote the hll.cc files, but the rest of it is changed as little as possible. It’s been a while, but I think the stuff in empirical_data.cc contains the now appropriately sorted values.
On Mar 6, 2017, at 1:29 AM, Tom Dial notifications@github.com wrote:
Actually, hold off. There's an issue with the data in the tables.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/dialtr/libcount/issues/14#issuecomment-284344960>, or mute the thread https://github.com/notifications/unsubscribe-auth/ AS0tCs4PoHovc7kwdHUFUWj7I9ZUOSWWks5ri9IBgaJpZM4Jl3qJ.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284538065, or mute the thread https://github.com/notifications/unsubscribe-auth/AH5tLKp-pvOtxKkR98tcuw1Sf6y5186Mks5rjHoUgaJpZM4Jl3qJ .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dialtr/libcount/issues/14#issuecomment-284550857, or mute the thread https://github.com/notifications/unsubscribe-auth/AS0tCpHeAL8_G6MIzeJcj6u7_vGPc3iRks5rjITtgaJpZM4Jl3qJ.
Linear interpolation fix completed. Note that the empirical data is now sorted properly. Also added a simple certification test. Just run:
make certify
The linear interpolation you wrote for estimating the biases seems like a reasonable approach, but I found a case where it leads to absurd estimates. I had a counter of precision 5 with a raw estimate of 128 point 6 something and the two closest raw estimates libcount found in the table were like 128.3464 and 128.3462, and their respective biases are about 1 apart. The linear interpolation said that it's bias should be in the thousands since the straight line connecting those two closest points is so steep. As a result, I got back a negative cardinality estimation, which is definitely not right. The spec in the paper just says "interpolate" without specifying how, so I would instead of just looking for the two nearest neighbors, look for the interval which contains the raw estimate and find where the estimate lies on that line. That seems like another reasonable interpolation and avoids the bias estimate for a given value to be wildly larger than that of its nearest neighbors.