Closed xuefeng-xu closed 10 months ago
In this case, if you want the object types to be preserved, you'll need to write your own serde that is able to recognize and encode/decode the specific type. It obviously won't be as compact as having the same type everywhere: You'll need some sort of indicator of what the type is for each item.
Thanks! One more question, just to make sure - do I need to edit this file below? https://github.com/apache/datasketches-python/blob/main/datasketches/PySerDe.py
You shouldn't need to edit anything. Just define your own serde object that implements the methods and pass it iun when you call serialize()
From the comment in that file: Each implementation must extend the PyObjectSerDe class and define three methods:
And there's an example of using the serde in the tests for the fi sketch:
fi_bytes = fi.serialize(PyIntsSerDe())
self.assertEqual(len(fi_bytes), fi.get_serialized_size_bytes(PyIntsSerDe()))
new_fi = frequent_items_sketch.deserialize(fi_bytes, PyIntsSerDe())
You can define something like this and use it as the serde instead of a built-in one:
import struct
class MixedItemSerDe(PyObjectSerDe):
def get_size(self, item):
if (type(item).__name__ == 'str'):
# type (1 char), length (4 bytes), string
return int(5 + len(item))
else:
# type (1 char), value (8 bytes, whether int or float)
return int(9)
def to_bytes(self, item):
b = bytearray()
item_type = type(item).__name__
match item_type:
case 'str':
b.extend(b's')
b.extend(len(item).to_bytes(4, 'little'))
b.extend(map(ord,item))
case 'int':
b.extend(b'q')
b.extend(struct.pack('<q', item))
case 'float':
b.extend(b'd')
b.extend(struct.pack('<d', item))
case _:
raise Exception(f'Only str, int, and float are supported. Found {item_type}')
return bytes(b)
def from_bytes(self, data: bytes, offset: int):
item_type = chr(data[offset])
match item_type:
case 's':
num_chars = int.from_bytes(data[offset+1:offset+4], 'little')
if (num_chars < 0 or num_chars > offset + len(data)):
raise IndexError(f'num_chars read must be non-negative and not larger than the buffer. Found {num_chars}')
val = data[offset+5:offset+5+num_chars].decode()
return (val, 5+num_chars)
case 'q':
val = struct.unpack_from('<q', data, offset+1)[0]
return (val, 9)
case 'd':
val = struct.unpack_from('<d', data, offset+1)[0]
return (val, 9)
case _:
raise Exception('Unknown item type found')
Note that I did NOT do extensive testing on this. It's also not particularly space-efficient if you know more about the lengths of things (stores string length as an int, stores all numbers with 8 bytes, etc.). But it's an example of how you can easily define a custom serde to solve this scenario.
Thanks, it really helps!
For example, I have a list contains int, float, and string types of data. I can get frequent items, after update the sketch. But how to serialize it?