apache / datasketches-postgresql

PostgreSQL extension providing approximate algorithms based on apache/datasketches-cpp
https://datasketches.apache.org
Apache License 2.0
84 stars 11 forks source link

Question: how to add an item to a theta sketch #48

Closed cjrh closed 2 years ago

cjrh commented 2 years ago

After creating the extension in Postgres, the routine theta_sketch_add_item exists, but it isn't usable from client SQL (the signature says internal). What is the correct way to add an item to an existing theta sketch? Currently I'm doing this:

        UPDATE table SET
            sketch = theta_sketch_union(sketch, (select theta_sketch_build($1)))
        WHERE
            id = $2;
AlexanderSaydakov commented 2 years ago

Yes, the theta_sketch_add_item function is an internal function. It is a state transition function that is needed to define an aggregate function. Usually sketches are built from raw data for particular periods of time and particular combinations of dimensions. Those datasets become the base table (hypercube). Then the base table is queried for a particular reporting period with a subset of dimensions. That would be the union of sketches.

cjrh commented 2 years ago

Thank you. I'll rethink the way I'm using it. My interface is a HTTP api that receives events and I'm adding those events directly to existing sketches with no intermediate storage. Perhaps I can introduce a buffer layer to accumulate events first into sketches and then add those intermediate sketches to the existing ones.

AlexanderSaydakov commented 2 years ago

Sketches are to help processing big data. Building a sketch for one record, deserializing an existing sketch, performing a union, serializing the result - all this adds a lot of overhead. This seems counterproductive. I would suggest collecting raw data for some period of time (say, an hour). When the close of hour happens, produce an aggregated segment.