SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
728 stars 183 forks source link

Possible room for latency improvement #913

Open saeid93 opened 1 year ago

saeid93 commented 1 year ago

As far as I understand the codec-friendly way of sending image/audio files in Seldon is sending images as NumPy arrays. Following the community slack discussion-1 and discussion-2 I ran a benchmark for audio and image datatypes and maybe this could potentially be improved by making an interface to send byte images directly through grpc. Currently, for sending bytes I do a bit of hardcoding, on the client side I do see:

            payload = types.InferenceRequest(
                inputs=[
                    types.RequestInput(
                        name="parameters-np",
                        shape=[1],
                        datatype="BYTES",
                        data=[self.data.tobytes()],
                        parameters=types.Parameters(
                            dtype='u1', datashape=str(self.data_shape)),
                    )
                ]
            )

And on the server side I do, see:

def decode_from_bin(inp, shape, dtype):
    buff = memoryview(inp)
    im_array = np.frombuffer(buff, dtype=dtype).reshape(shape)
    return im_array

This showed the best performance among all the combinations discussed in here for imge and audio datatypes and maybe potentially could be added natively to MLServer+Seldon stack.

adriangonz commented 1 year ago

(copying this from our discussion in Slack)

Hey @saeid93 ,

That’s a great point.

It sounds very similar to something we added recently to pack tensors together. This mainly stems from something that Triton does to optimise gRPC performance. There’re a bit more details in this issue:

https://github.com/SeldonIO/MLServer/issues/48

The gist of it is that instead of populating each data field separately, we pack together the data field of each input (or output) into a single bytes blob, which then gets added at a top-level field of the protobuf (this is to work around an obscure gRPC performance issue).

You can see most of the logic in the https://github.com/SeldonIO/MLServer/blob/master/mlserver/raw.py file. For tensors, it’s actually very similar to what numpy does with tobytes / frombuffer - with the exception that it’s implemented without numpy.