apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.
https://datasketches.apache.org
Apache License 2.0
878 stars 206 forks source link

Python Bindings for HLL Sketch #161

Closed anirudhacharya closed 6 years ago

anirudhacharya commented 6 years ago

As per the discussion in this thread in the google group - https://groups.google.com/d/msg/sketches-user/8TaAXaT_6qo/A2JJkIuZBQAJ

there is traction for having a python binding for the different sketch families in sketches-core, similar to how the library has for pig and hive. I was thinking we could get started on the python adaptors by having a wrapper library for the hyperloglog sketches. Would that be a good place to start?

For Pig and Hive the bindings were defined as UDFs that pig and hive scripts can use. How will we define the wrapper classes in python? Will it be something on the lines of Jython - http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html

jmalkin commented 6 years ago

If it would add a package dependency to the pom, python binding will need to live in a different repository.

AlexanderSaydakov commented 6 years ago

I wouldn't call this "traction". No third party expressed any interest yet.

leerho commented 6 years ago

Having the library adapted for different languages makes a lot of sense and Python is a good place to start. As Jon points out we may need to set up a separate repo for language adaptors or merge it into one of the other repos. Sketches-misc might be a candidate.

anirudhacharya commented 6 years ago

Python does not use maven for build and dependency management. It has a very different tool called distutils( DistUtils).

Also I have to still explore the performance overhead that a python wrapper might induce on the existing sketches-core library. Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

I see that Edo Liberty who was till recently in Yahoo already has a python repo for two of the sketching algorithms here - Frequest Directions and Streaming quantiles

Also, there is an existing python library published for CountMin sketch . But I don't think there are any implementations of HLL, theta sketch, tuple sketch, or sampling sketches in python.

edoliberty commented 6 years ago

My Python repos are meant for research purposes. They are unrelated to the sketches-core library.

As a whole, I’m a big python fan. I also see that python binding were, for many open source projects, the inflection point that started wide user base adoption.

I’m all for creating Python binding if we could do it in a thoughtful way that doesn’t compromise performence.

On Sun, Oct 8, 2017 at 14:58 Anirudh notifications@github.com wrote:

Python does not use maven for build and dependency management. It has a very different tool called distutils( http://docs.activestate.com/activepython/3.2/diveintopython3/html/packaging.html ).

Also I have to still explore the performance overhead that a python wrapper might induce on the existing sketches-core library. Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

I see that Edo Liberty https://edoliberty.github.io// http://url who was till recently in Yahoo already has a python repo for two of the sketching algorithms here - https://github.com/edoliberty/frequent-directions http://url and https://github.com/edoliberty/streaming-quantiles http://url

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/161#issuecomment-335041694, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAIed2wiwUQSk2X4MONpLWRJpkDtGF0ks5sqUVzgaJpZM4PrXk_ .

leerho commented 6 years ago

I would strongly advise against attempting to write a python DataSketches library from scratch:

The public APIs of the core library, on the other hand, are quite stable, and "relatively" consistent across the different sketch families.

There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

edoliberty commented 6 years ago

Lee, we all agree with you. The suggestion was to find a way to use the java library from within python without compromising performentce.

On Wed, Oct 11, 2017 at 04:14 Lee Rhodes notifications@github.com wrote:

I would strongly advise against attempting to write a python DataSketches library from scratch:

  • The implementations of the various sketches in the core library have been highly optimized for performance, and as a result, the implementations are quite complex and leverage a lot of subtle techniques. Any single sketch, such as the HllSketch, is not one algorithm, but a collection of algorithms and techniques. Redesigning even a single sketch from scratch in Python is a huge task and without the knowledge of the design internals you would be at a big disadvantage.
  • A parallel implementation would not benefit from continuous updates and performance improvements of the core java library. This would double the effort in maintaining and supporting the python code base. Our small development team could not possibly take this on or support it.

The public APIs of the core library, on the other hand, are quite stable, and "relatively" consistent across the different sketch families.

There are Python/Java tools out there, e.g., py4j https://www.py4j.org/index.html, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/161#issuecomment-335776857, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAIeTw0XaIcNZ2VNMOZrzGo0LAVeXkFks5srKMPgaJpZM4PrXk_ .

leerho commented 6 years ago

My comment was in response to the suggestion:

Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

edoliberty commented 6 years ago

Ahh, I missed that.

On Wed, Oct 11, 2017 at 07:59 Lee Rhodes notifications@github.com wrote:

My comment was in response to the suggestion:

Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/161#issuecomment-335840067, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAIeYF_aX4nMebGYxTs3DAW1F5eWHO2ks5srNffgaJpZM4PrXk_ .

anirudhacharya commented 6 years ago

I can work on this on weekends. I will invest time in this direction.

There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

edoliberty commented 6 years ago

That's great Anirudh I think you are doing the right thing. If one could pip install datasketches it will be a major driver of adoption. A note about pure python, I (and others) have tried several times to get it to perform. It just doesn't. The Java library is way (way!) faster. Edo

On Wed, Oct 11, 2017 at 11:19 PM, Anirudh notifications@github.com wrote:

I can work on this on weekends. I will invest time in this direction.

There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DataSketches/sketches-core/issues/161#issuecomment-336032026, or mute the thread https://github.com/notifications/unsubscribe-auth/AMAIeT_DqkKLB7Kam-KGZiS84NNi_XI4ks5sra9_gaJpZM4PrXk_ .

leerho commented 6 years ago

There has been no activity on this issue for several weeks, so I am closing this issue for now. We can always reopen this issue in the future.