Open sauerburger opened 10 months ago
I just discovered the following in the open inference protocol:
message ModelInferResponse
{
// ...
// The output tensors holding inference results.
repeated InferOutputTensor outputs = 5;
// The data contained in an output tensor can be represented in
// "raw" bytes form or in the repeated type that matches the
// tensor's data type. To use the raw representation 'raw_output_contents'
// must be initialized with data for each tensor in the same order as
// 'outputs'. For each tensor, the size of this content must match
// what is expected by the tensor's shape and data type. The raw
// data must be the flattened, one-dimensional, row-major order of
// the tensor elements without any stride or padding between the
// elements. Note that the FP16 and BF16 data types must be represented as
// raw content as there is no specific data type for a 16-bit float type.
//
// If this field is specified then InferOutputTensor::contents must
// not be specified for any output tensor.
repeated bytes raw_output_contents = 6;
So, 16-bit floats should actually go to raw_output_contents. Not sure, why this didn't work in my case
I think I discovered a bug in the current gRPC code in mlserver. I have a model that returns float16 arrays and I tried to get predictions via gRPC. I could narrow down the issue to this example without any client-server complexity.
Reproduce error
The last line yields
Root cause
I think the root cause is in the gRPC type-to-field mapping:
The code uses bytes in the dataplane for
FP16
inputs. The dataplane doesn't even offer afp16_contents
field that could be used for the purpose. (Is it because protobuf doesn't support fp16 by default?)Potential fix
I think in this case, the
fp32_content
should be used in the gRPC type-to-field mapping. Although, this wastes half of the bandwidth.