apache / datasketches-python

Apache datasketches
https://datasketches.apache.org
Apache License 2.0
22 stars 4 forks source link

Vectorize update() #21

Open jmalkin opened 11 months ago

jmalkin commented 11 months ago

Looping in python is slow. We should enable update() with multiple inputs where C++ handles the iteration, for all sketches.

For sketches that take primitive types this is simple and can be done by overloading update(). For item containers it may be less straightforward since a list is a type of object, meaning an overloaded may treat the list as a single thing to ingest.

IIRC, the wrappers (both pybind11 and nanobind) iterate through the possible methods in the order ini which they were declared in the wrapper definition, which is not good for API design: Lack of transparency, internal rearranging of code can cause side-effects, etc. So we probably need a different method name. Then we run into the question of whether we should use an overload where practical and a different name where necessary or if we go for more consistency.

jmalkin commented 10 months ago

Proposing that we allow update() to operate on vectors for primitives, but because that won't work for generic objects we'll use update_batch(). The latter will also exist and work for primitives.

c-dickens commented 10 months ago

I second this change and am fine with the naming. This idea should not be too unusual for python users (for example, there are fit and partial_fit methods in scikit-learn).