apache / datasketches-postgresql

PostgreSQL extension providing approximate algorithms based on apache/datasketches-cpp
https://datasketches.apache.org
Apache License 2.0
85 stars 11 forks source link

Question: How to safely INSERT kll sketch #71

Closed mcgrawia closed 1 month ago

mcgrawia commented 6 months ago

Hi team,

Thanks for the great extension. I'm attempting to use datasketches-python to pre-aggregate data and then insert the sketches into Postgres. I was curious, is there a recommended way to insert the serialized sketches? I noticed I can insert them as base64 encoded strings, but it doesn't appear that there is any validation on the contents of the string on INSERT, so if I accidentally insert a bad string, say "", I end up with bad data in the table that breaks queries.

Here's an example of inserting a bad sketch string. Is there function or better way to insert that would prevent the bad data from entering the table? Thanks

create table kll_sketch_test(sketch kll_float_sketch);
insert into kll_sketch_test(sketch) VALUES ('');
select kll_float_sketch_get_n(sketch) from kll_sketch_test;  -- this query breaks because of the bad sketch value

I noticed there is a kll_float_sketch_in() function, but it doesn't seem to help:

insert into kll_sketch_test(sketch) VALUES (kll_float_sketch_in(''));

Thanks for any help or pointers you can provide.

AlexanderSaydakov commented 6 months ago

I don't quite understand the concern. Just don't insert bad data.

mcgrawia commented 6 months ago

Hi @AlexanderSaydakov, the reason is that the Python sketches are being computed and uploaded to Postgres over a REST API. As part of API best practices, we are attempting to validate the data that clients are sending. I was expecting Postgres to validate data formats on INSERT, similar to other types like integer/float/varchar, which return an error if you attempt to insert invalid data for those formats. Is there no such validation for the kll_sketch types in this extension?

AlexanderSaydakov commented 6 months ago

There is no validation. Sketches are binary blobs (bytea type). In and out functions handle Base64 decoding and encoding. Perhaps you could set up a trigger before insert to check? What type of checking would you do? Do you have a malicious intent in mind or corruption in transit? The trigger could do some basic check that the binary can be deserialized (say, call get_n() function). However, there is no checksum or anything like that. If corruption is possible, it can lead to incorrect results. We assumed that checking integrity of transport in outside of the scope of the library.

mcgrawia commented 6 months ago

Thanks @AlexanderSaydakov for the tips. We are not worried about corruption in transit, but potentially users uploading incorrect sketch representations to the API. We have client side functionality for computing the sketch, but users may try to upload their own, so we want to prevent users from accidentally uploading incorrect string representations of a sketch that could lead to breaking queries. Since users have direct access to the API, we need to validate any data they send. We can't be sure our SDK produced the sketch string representation, or that it is a valid representation. For now, we're parsing the sketch in our application before inserting into the database, but I think the trigger could work as well.