apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
223 stars 71 forks source link

How to serialize `frequent_items_sketch` with mixed data types? #405

Closed xuefeng-xu closed 10 months ago

xuefeng-xu commented 11 months ago

For example, I have a list contains int, float, and string types of data. I can get frequent items, after update the sketch. But how to serialize it?

from datasketches import (
    frequent_items_sketch,
    PyFloatsSerDe,
    PyIntsSerDe,
    PyStringsSerDe,
    PyLongsSerDe,
    PyDoublesSerDe,
)

fi = frequent_items_sketch(3)
X = [1, 3.4, 'a']
W = [20, 30, 25]
for x, w in zip(X, W):
    fi.update(x, w)
print(fi.get_frequent_items(frequent_items_error_type.NO_FALSE_POSITIVES))
# [(3.4, 30, 30, 30), ('a', 25, 25, 25), (1, 20, 20, 20)]

fi_bytes = fi.serialize()
# how to serialize the sketch?
jmalkin commented 11 months ago

In this case, if you want the object types to be preserved, you'll need to write your own serde that is able to recognize and encode/decode the specific type. It obviously won't be as compact as having the same type everywhere: You'll need some sort of indicator of what the type is for each item.

xuefeng-xu commented 11 months ago

Thanks! One more question, just to make sure - do I need to edit this file below? https://github.com/apache/datasketches-python/blob/main/datasketches/PySerDe.py

jmalkin commented 11 months ago

You shouldn't need to edit anything. Just define your own serde object that implements the methods and pass it iun when you call serialize()

From the comment in that file: Each implementation must extend the PyObjectSerDe class and define three methods:

And there's an example of using the serde in the tests for the fi sketch:

fi_bytes = fi.serialize(PyIntsSerDe())
self.assertEqual(len(fi_bytes), fi.get_serialized_size_bytes(PyIntsSerDe()))
new_fi = frequent_items_sketch.deserialize(fi_bytes, PyIntsSerDe())
jmalkin commented 11 months ago

You can define something like this and use it as the serde instead of a built-in one:

import struct
class MixedItemSerDe(PyObjectSerDe):
  def get_size(self, item):
    if (type(item).__name__ == 'str'):
      # type (1 char), length (4 bytes), string
      return int(5 + len(item))
    else:
      # type (1 char), value (8 bytes, whether int or float)
      return int(9)

  def to_bytes(self, item):
    b = bytearray()
    item_type = type(item).__name__
    match item_type:
      case 'str':
        b.extend(b's')
        b.extend(len(item).to_bytes(4, 'little'))
        b.extend(map(ord,item))
      case 'int':
        b.extend(b'q')
        b.extend(struct.pack('<q', item))
      case 'float':
        b.extend(b'd')
        b.extend(struct.pack('<d', item))
      case _:
        raise Exception(f'Only str, int, and float are supported. Found {item_type}')
    return bytes(b)

  def from_bytes(self, data: bytes, offset: int):
    item_type = chr(data[offset])
    match item_type:
      case 's':
        num_chars = int.from_bytes(data[offset+1:offset+4], 'little')
        if (num_chars < 0 or num_chars > offset + len(data)):
          raise IndexError(f'num_chars read must be non-negative and not larger than the buffer. Found {num_chars}')
        val = data[offset+5:offset+5+num_chars].decode()
        return (val, 5+num_chars)
      case 'q':
        val = struct.unpack_from('<q', data, offset+1)[0]
        return (val, 9)
      case 'd':
        val = struct.unpack_from('<d', data, offset+1)[0]
        return (val, 9)
      case _:
        raise Exception('Unknown item type found')

Note that I did NOT do extensive testing on this. It's also not particularly space-efficient if you know more about the lengths of things (stores string length as an int, stores all numbers with 8 bytes, etc.). But it's an example of how you can easily define a custom serde to solve this scenario.

xuefeng-xu commented 10 months ago

Thanks, it really helps!