SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
685 stars 177 forks source link

Fix Pandas codec decoding from numpy arrays #1751

Closed lhnwrk closed 3 months ago

lhnwrk commented 3 months ago

Pandas codec currently breaks mlflow runtime's schema enforcement (#1625). This is because numeric mlflow datatypes are assigned NumpyCodec content type by default, which decodes PandasCodec's explicit batch size into a column vector:

input = RequestInput(
    name="foo",
    shape=[3, 1],
    data=[1,2,3],
    datatype="INT64",
)
NumpyCodec.decode_input(input)  # np.array([[1], [2], [3]])

However, pd.Series expects one-dimensional data, and when the _decoded_payload is converted to a pd.Series here returns unexpected results:

# Expect pd.Series([1,2,3]), got pd.Series([(1,), (2,), (3,)])
payload = np.array([[1], [2], [3]])
payload = list(payload)
pd.Series(payload, dtype="int64")

This seems like a bug from pandas side, dtype="int64" should have thrown an error (and in fact it does for higher dimension, but for some reason is okay with [np.array([1]), np.array([2]), np.array([3])] and weirdly casts them into tuples), but either way mlserver should have dropped the trailing axis as pd.Series already implies column vector.

From this test, it seems like the conversion to list in the codec was to accommodate the use case of pd.Series of tensors:

# Expect decoded pd.Series([np.array([1,2,3])]) in test
RequestInput(
    name="a",
    data=[1, 2, 3],
    datatype="FP32",
    shape=[1, 3],
    parameters=Parameters(_decoded_payload=np.array([[1, 2, 3]])),
)

I'd argue in the context of Pandas codec this should be made explicit with shape [1, 1] and datatype BYTES for array elements. In fact, a test above does so, specifying the intended shape is a [2,1] column vector with datatype BYTES for list elements.

# Expect encoded ResponseOutput(name="bar", shape=[2, 1], data=[[1, 2, 3], [4, 5, 6]], datatype="BYTES") in test
pd.Series(data=[[1, 2, 3], [4, 5, 6]], name="bar")

This PR explicitly reshapes np.array payload to expected shapes before conversion to pd.Series.

jesse-c commented 3 months ago

We've gotten that fix in. @lhnwrk: could you please rebase this branch on top of master?

lhnwrk commented 3 months ago

@jesse-c Rebased on master!

jesse-c commented 3 months ago

@lhnwrk: Thank you! I've had to re-approve it and run the checks, so they're going now. As soon as it passes, I'll merge it in.

jesse-c commented 3 months ago

Merged in! Thank you again, @lhnwrk. šŸ™šŸ»