Possible room for latency improvement

SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more

Apache License 2.0

728 stars 183 forks source link

payload = types.InferenceRequest( inputs=[ types.RequestInput( name="parameters-np", shape=[1], datatype="BYTES", data=[self.data.tobytes()], parameters=types.Parameters( dtype='u1', datashape=str(self.data_shape)), ) ] )

(copying this from our discussion in Slack)

Hey @saeid93 ,

That’s a great point.

It sounds very similar to something we added recently to pack tensors together. This mainly stems from something that Triton does to optimise gRPC performance. There’re a bit more details in this issue:

https://github.com/SeldonIO/MLServer/issues/48

The gist of it is that instead of populating each data field separately, we pack together the data field of each input (or output) into a single bytes blob, which then gets added at a top-level field of the protobuf (this is to work around an obscure gRPC performance issue).

You can see most of the logic in the https://github.com/SeldonIO/MLServer/blob/master/mlserver/raw.py file. For tensors, it’s actually very similar to what numpy does with tobytes / frombuffer - with the exception that it’s implemented without numpy.

SeldonIO / MLServer

Possible room for latency improvement #913