Improve (de-)serialization performance for scalar arrays

124C41p commented 1 year ago

One of my personal use cases for betterproto is to call SciPy functions from other languages, which do not have such nice math libraries. That is, I have a Python grpc service which essentially is receiving large float arrays, doing math with them, and sending large result arrays in return. Unfortunately, serializing and deserializing (numpy) arrays with betterproto does not seem to be as efficient, as it could be.

When serializing a float array to the protobuf format, the serialized protobuf message happens to be exactly in the right byte format to be interpreted as a numpy array, as the following example shows:

@dataclass
class Array(betterproto.Message):
    values: List[float] = betterproto.double_field(1)

proto_array = Array(values=[1.23, 2.34, 3.45, 4.56])
serialized_array = bytes(proto_array)
np_array = np.frombuffer(serialized_array[2:])

print(np_array)

However, when deserializing the protobuf message with betterproto, the array is converted into a Python list right away. If I need a numpy array, I have no choice but to convert it back and forth (same for serialization) which is computationally expensive.

I have two ideas for solving that issue:

Idea 1: Instead of storing protobuf scalar arrays as Python lists, you could store their byte representation inside a slim wrapper which behaves like a list, but which can also be converted into a numpy array without efforts:

class Float64Array:
    __data: bytes

    def __len__(self):
        ...

    def __getitem__(self, i):
        ...

    def to_numpy_array(self):
        import numpy as np
        return np.frombuffer(self.__data)

Idea 2: You could introduce an optional protoc compiler flag for letting the caller decide whether scalar arrays should be stored as Python lists or as numpy arrays.

Gobot1234 commented 1 year ago

Seems related to #309

124C41p commented 1 year ago

For benchmarking, I serialized and deserialized a random numpy array of length one million using this snippet:

import betterproto
import numpy as np
from dataclasses import dataclass
from typing import List
from time import time

@dataclass
class Array(betterproto.Message):
    values: List[float] = betterproto.double_field(1)

def serialize(ar):
    return bytes(Array(values = ar.tolist()))

def deserialize(bs):
    return np.array(Array().parse(bs).values)

def benchmark(fun, n, *args):
    t = time()
    for _ in range(n):
        fun(*args)
    t = (time()-t)*1000/n
    print(f'{t:.0f}ms')

def bench_ser(n, l):
    ar = np.random.random(l)
    benchmark(serialize, n, ar)

def bench_des(n, l):
    bs = serialize(np.random.random(l))
    benchmark(deserialize, n, bs)

bench_ser(10, 1_000_000)
bench_des(10, 1_000_000)

On my notebook, serializing takes 500ms on average, and deserializing takes 1000ms. On my Raspberry Pi 4, serializing takes 3000ms, deserializing takes 7500ms.

I think these computation times can be almost completely avoided, since there is no actual difference between a numpy float64 array and a protobuf repeated double field.

124C41p commented 1 year ago

I just realized that conversion between Python lists and numpy arrays isn't actually that big of a deal (after dropping everything numpy related from the benchmark snippet above, the program will still output roughly the same times). The real time consuming thing is conversion between protobuf repeated (scalar) fields and Python lists. So even if you like to stay with Python lists for data storage, there is a huge potential for optimizations in the (de-)serialization algorithms.

Have you already thought of using a native extension module for (de-)serialization? I understand that the charm of betterproto partially comes from the fact that it is written purely in Python. But maybe there could be a compromise like a customizable (maybe plugin based?) (de-)serialization algorithm?

For example, in my personal use case mentioned above I would really benefit from a simple way of swapping that part of the algorithm, where repeated double fields are (de-)serialized.

124C41p commented 9 months ago

This is resolved by #545

danielgtaylor / python-betterproto

Improve (de-)serialization performance for scalar arrays #515