Closed mcgrawia closed 1 month ago
I don't quite understand the concern. Just don't insert bad data.
Hi @AlexanderSaydakov, the reason is that the Python sketches are being computed and uploaded to Postgres over a REST API. As part of API best practices, we are attempting to validate the data that clients are sending. I was expecting Postgres to validate data formats on INSERT, similar to other types like integer/float/varchar, which return an error if you attempt to insert invalid data for those formats. Is there no such validation for the kll_sketch
types in this extension?
There is no validation. Sketches are binary blobs (bytea type). In and out functions handle Base64 decoding and encoding. Perhaps you could set up a trigger before insert to check? What type of checking would you do? Do you have a malicious intent in mind or corruption in transit? The trigger could do some basic check that the binary can be deserialized (say, call get_n() function). However, there is no checksum or anything like that. If corruption is possible, it can lead to incorrect results. We assumed that checking integrity of transport in outside of the scope of the library.
Thanks @AlexanderSaydakov for the tips. We are not worried about corruption in transit, but potentially users uploading incorrect sketch representations to the API. We have client side functionality for computing the sketch, but users may try to upload their own, so we want to prevent users from accidentally uploading incorrect string representations of a sketch that could lead to breaking queries. Since users have direct access to the API, we need to validate any data they send. We can't be sure our SDK produced the sketch string representation, or that it is a valid representation. For now, we're parsing the sketch in our application before inserting into the database, but I think the trigger could work as well.
Hi team,
Thanks for the great extension. I'm attempting to use datasketches-python to pre-aggregate data and then insert the sketches into Postgres. I was curious, is there a recommended way to insert the serialized sketches? I noticed I can insert them as base64 encoded strings, but it doesn't appear that there is any validation on the contents of the string on INSERT, so if I accidentally insert a bad string, say "", I end up with bad data in the table that breaks queries.
Here's an example of inserting a bad sketch string. Is there function or better way to insert that would prevent the bad data from entering the table? Thanks
I noticed there is a
kll_float_sketch_in()
function, but it doesn't seem to help:Thanks for any help or pointers you can provide.