addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 559 forks source link

Hyperbitbit #131

Closed positiveblue closed 7 years ago

positiveblue commented 7 years ago

This is the first implementation of HyperBitBit. It is totally functional and has the algorithm implemented following the Robert Sedgewick presentation at AofA'16.

There are some tests for .cardinality, and the algorithm is documented in its own file.

oertl commented 7 years ago

I do not know if the HyperBitBit algorithm is a serious alternative to the HyperLogLog algorithm. The HyperLogLog algorithm has following important properties:

1) inserting the same element multiple times does not further increase the cardinality estimate 2) the insertion order does not play a role 3) the data structure can be merged

If you give up some of these properties you can get more space-efficient algorithms. For example the self-learning bitmap or the historic inverse probability (HIP) HyperLogLog estimator give up properties 2 and 3.

The HyperBitBit algorithm even gives up property 1. Therefore, I believe the algorithm only works, if you implicitly make assumptions about the incoming data stream. The included unit test only checks the most simple case, counting the unique elements of a data stream which consists only of unique elements. For this use case I would use a single integer which is obviously even more space efficient. I think between the two extreme cases, HyperLogLog counter which makes no assumptions on the incoming data and the integer counter assuming all incoming elements are distinct, there are likely many algorithms with space needs between both. I believe HyperBitBit is one of them. However, the assumptions it makes on incoming data is not really studied yet.

abramsm commented 7 years ago

@oertl agree that it is not clear that HyperBitBit is a serious alternative for production use. That said I think that it is OK and in fact good to have additional algorithms in the library for comparison and academic purposes. We probably need a best practices FAQ in the project so that users of the library use the best algorithm for their needs. Another example is we still have the older versions of HyperLogLog in the library and there are very few cases where using those would be better than using HyperLogLogPlus.

oertl commented 7 years ago

@abramsm The HyperBitBit algorithm is just an idea presented on a single slide without any deeper theoretical analysis. (This is in contrast to other older cardinality estimation algorithms.) The limitations of the HyperBitBit algorithm are simply not known. I would add some comments that the algorithm should be used with caution. Otherwise, I fear people will just use it without knowing they could get wrong results. As there is no theoretical justification, I also recommend to add a couple more unit tests that are more sophisticated than the current one. The current unit test would even pass for a simple integral counter that is incremented each time an element is offered.

positiveblue commented 7 years ago

@oertl I highly agree with all the points that you exposed above. HyperBitBit is not a mature algorithm and should not be used in production expecting the same guarantees as HyperLogLog, KMV or Recordinality.

This is more a "seed" implementation of the algorithm, and I hope more people working on it to improve and to refine the idea.

Robert Sedgewick says in his presentation that the magic number (5.4) which appears in the formula to estimate the cardinality has been obtained empirically using real world data, but I could not find any evidence of it (which data?). The two points I am more worried about are:

Even so, I think would be nice to have an implementation of the algorithm in stream-lib given the nature of the algorithm and library.

I will add more tests to the algorithm, but they will take into account that this implementation doesn't aim right now to improve the results that can be accomplished using other algorithms of the library.

abramsm commented 7 years ago

Good discussion. I suggest we make a 'research' or 'experimental' package for implementations like this. Hopefully it will make it obvious to the casual observer that the implementation is not meant/ready for production usage.

Matt

On Mon, Mar 27, 2017 at 2:31 PM, Jordi Montes Sanabria < notifications@github.com> wrote:

@oertl https://github.com/oertl I highly agree with all the points that you exposed above. HyperBitBit is not a mature algorithm and should not be used in production expecting the same guarantees as HyperLogLog, KMV or Recordinality.

This is more a "seed" implementation of the algorithm, and I hope more people working on it to improve and to refine the idea.

Robert Sedgewick says in his presentation that the magic number (5.4) which appears in the formula to estimate the cardinality has been obtained empirically using real world data, but I could not find any evidence of it (which data?). The two points I am more worried about are:

  • The structure can not be merged (and seems really hard to find a way to mix it)
  • Inserting the same element multiple times does not further increase the cardinality estimate (in some real world cases this can be a problem)

Even so, I think would be nice to have an implementation of the algorithm in stream-lib given the nature of the algorithm and library.

I will add more tests to the algorithm, but they will take into account that this implementation doesn't aim right now to improve the results that can be accomplished using other algorithms of the library.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/pull/131#issuecomment-289576524, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfa-3Wj2nwPBOV8R6SfhXJ-SBZHdheks5rqByIgaJpZM4MpLXE .

positiveblue commented 7 years ago

I guess that would solve the problem with "casual" users.

How would you organize the new folder? Would you put it under com.clearspring.analytics or you would create a new com.clearsping.experimental (I am not a java developer, so I am not sure which one makes more sense)

positiveblue commented 7 years ago

I already created the com.clearsping.experimental packager. I included HyperBitBit with its tests (by now with the @ignore annotation) on the package so it should be "safe" to merge the pull request without having to worry about people using it by error.

As I commented on the mailing list, I implemented HyperBitBit after a Robert Sedgewick's talk. It was quite easy and it did not require a big effort. After checking I only found and implementation in Go so I thought it could be useful for the future. However, I am doing some research on other cardinality algorithms (Recordinality/AdKMV) which are might be competitive with HyperLogLog. I would like to use stream-lib in my research so if the experimental package succeed we could move them to the analytics package easily having a production ready implementation.

Let me know if you need something else.

positiveblue commented 7 years ago

I am not sure if Travis is complaining about some changes that I introduce in the branch or it was something else. The error only appears in the VM with openJDK 7 installed.

The output is:

*** buffer overflow detected ***: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java terminated

That day many other pull request failed. I saw that they fixed it removing the openJDK from the travis build.

I would like to finish this and move on with the AdaKMV implementation because it is the algorithm that really brings some value to the library.

Which should be the next step?

Thanks.

abramsm commented 7 years ago

approving this PR for experimental branch. we should probably add a README that things in this branch should be used for research purposes only.