SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
706 stars 181 forks source link

gRPC fails with inferred f16 numpy array #1522

Open sauerburger opened 10 months ago

sauerburger commented 10 months ago

I think I discovered a bug in the current gRPC code in mlserver. I have a model that returns float16 arrays and I tried to get predictions via gRPC. I could narrow down the issue to this example without any client-server complexity.

Reproduce error

import numpy as np
from mlserver.codecs.decorator import SignatureCodec
import mlserver.grpc.converters as converters

def a() -> np.ndarray:
    return np.array([[1.123, 4], [1, 3], [1, 2]], dtype=np.float16)

codec = SignatureCodec(a)
r = codec.encode_response(payload=a(), model_name="x")
converters.ModelInferResponseConverter.from_types(r)

The last line yields

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 380, in from_types
    InferOutputTensorConverter.from_types(output)
  File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 425, in from_types
    contents=InferTensorContentsConverter.from_types(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/mlserver/grpc/converters.py", line 335, in from_types
    return pb.InferTensorContents(**contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected bytes, float found

Root cause

I think the root cause is in the gRPC type-to-field mapping:

_FIELDS = {
    ...
    "FP16": "bytes_contents",
    "FP32": "fp32_contents",
    "FP64": "fp64_contents",
    "BYTES": "bytes_contents",
}

The code uses bytes in the dataplane for FP16 inputs. The dataplane doesn't even offer a fp16_contents field that could be used for the purpose. (Is it because protobuf doesn't support fp16 by default?)

Potential fix

I think in this case, the fp32_content should be used in the gRPC type-to-field mapping. Although, this wastes half of the bandwidth.

sauerburger commented 10 months ago

I just discovered the following in the open inference protocol:

message ModelInferResponse
{
  // ...

  // The output tensors holding inference results.
  repeated InferOutputTensor outputs = 5;

  // The data contained in an output tensor can be represented in
  // "raw" bytes form or in the repeated type that matches the
  // tensor's data type. To use the raw representation 'raw_output_contents'
  // must be initialized with data for each tensor in the same order as
  // 'outputs'. For each tensor, the size of this content must match
  // what is expected by the tensor's shape and data type. The raw
  // data must be the flattened, one-dimensional, row-major order of
  // the tensor elements without any stride or padding between the
  // elements. Note that the FP16 and BF16 data types must be represented as
  // raw content as there is no specific data type for a 16-bit float type.
  //
  // If this field is specified then InferOutputTensor::contents must
  // not be specified for any output tensor.
  repeated bytes raw_output_contents = 6;

So, 16-bit floats should actually go to raw_output_contents. Not sure, why this didn't work in my case