bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
https://bentoml.com
Apache License 2.0
7.13k stars 791 forks source link

bug: ValueError in PandasDataFrame.validate_dataframe when input DataFrame has different shape that implied by from_sample #3263

Closed mqk closed 1 year ago

mqk commented 1 year ago

Describe the bug

I have an endpoint that takes as input a pandas dataframe, for which I have provided example data using PandasDataFrame.from_sample. I don't want to enforce the input data to have the exact shape of the example dataframe and I've kept the default of enforce_shape=False. However, when I make a request with an input dataframe that has a different shape than the example data, then I get a runtime ValueError in the validate_dataframe method. Here's a traceback (in the example the example dataframe has 150 columns, but I'm passing in a larger dataframe with 157 columns):

2022-11-19T19:54:42.391928524Z Traceback (most recent call last):
2022-11-19T19:54:42.391939288Z   File "/usr/local/lib/python3.8/site-packages/bentoml/_internal/server/http_app.py", line 311, in api_func
2022-11-19T19:54:42.391945763Z     input_data = await api.input.from_http_request(request)
2022-11-19T19:54:42.391951951Z   File "/usr/local/lib/python3.8/site-packages/bentoml/_internal/io_descriptors/pandas.py", line 533, in from_http_request
2022-11-19T19:54:42.391957398Z     return self.validate_dataframe(res)
2022-11-19T19:54:42.391962779Z   File "/usr/local/lib/python3.8/site-packages/bentoml/_internal/io_descriptors/pandas.py", line 604, in validate_dataframe
2022-11-19T19:54:42.391968054Z     dataframe.columns = pd.Index(self._columns)
2022-11-19T19:54:42.391973045Z   File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 5588, in __setattr__
2022-11-19T19:54:42.391978615Z     return object.__setattr__(self, name, value)
2022-11-19T19:54:42.391985015Z   File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
2022-11-19T19:54:42.391990375Z   File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 769, in _set_axis
2022-11-19T19:54:42.391995405Z     self._mgr.set_axis(axis, labels)
2022-11-19T19:54:42.392000723Z   File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 214, in set_axis
2022-11-19T19:54:42.392005859Z     self._validate_set_axis(axis, new_labels)
2022-11-19T19:54:42.392010577Z   File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/base.py", line 69, in _validate_set_axis
2022-11-19T19:54:42.392015882Z     raise ValueError(
2022-11-19T19:54:42.392020663Z ValueError: Length mismatch: Expected axis has 157 elements, new values have 150 elements

The root of the problem is here: https://github.com/bentoml/BentoML/blob/8774eb1f101e06a319919c45182a37d1e15070c0/src/bentoml/_internal/io_descriptors/pandas.py#L604 That line makes the assumption that the input dataframe has the same number of columns as the example data (self._columns).

I would recommend either not assigning dataframe.columns here at all (so just getting rid of this line), or only making the assignment if len(dataframe.columns) == len(self._columns), or only doing this if if self._enforce_shape is True.

To reproduce

No response

Expected behavior

No response

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.0.10 python: 3.8.14 platform: macOS-12.6-x86_64-i386-64bit uid_gid: 502:20

pip_packages
``` aiofiles==22.1.0 aiohttp==3.8.1 aiosignal==1.2.0 alexandria==0.0.31 anyio==3.6.1 appdirs==1.4.4 appnope==0.1.3 asgiref==3.5.2 asttokens==2.0.8 async-timeout==4.0.2 attrs==22.1.0 awscli==1.25.81 backcall==0.2.0 bentoml==1.0.10 bleach==5.0.1 botocore==1.27.80 build==0.8.0 cachetools==5.2.0 carthage==0.0.10 cattrs==22.2.0 certifi==2022.9.24 cfgv==3.3.1 chardet==5.0.0 charset-normalizer==2.1.1 circus==0.17.1 click==8.1.3 click-log==0.4.0 cloudpickle==2.2.0 colorama==0.4.4 commonmark==0.9.1 contextlib2==21.6.0 decisioning-schemas==1.1.42 decorator==5.1.1 deepmerge==1.0.1 Deprecated==1.2.13 distlib==0.3.6 docutils==0.16 dotty-dict==1.3.1 exceptiongroup==1.0.0rc9 executing==1.1.0 filelock==3.8.0 frozenlist==1.3.1 fs==2.4.16 fsspec==2021.10.0 gcsfs==2021.10.0 gitdb==4.0.9 GitPython==3.1.27 google-auth==2.12.0 google-auth-oauthlib==0.5.3 h11==0.14.0 identify==2.5.6 idna==3.4 importlib-metadata==5.0.0 iniconfig==1.1.1 invoke==1.7.3 ipython==8.4.0 jaraco.classes==3.2.3 jedi==0.18.1 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.2.0 keyring==23.9.3 lightgbm==3.0.0 MarkupSafe==2.1.1 matplotlib-inline==0.1.6 missionlane-versioning==0.7.8 more-itertools==8.14.0 multidict==6.0.2 nodeenv==1.7.0 numpy==1.23.3 oauthlib==3.2.1 opentelemetry-api==1.13.0 opentelemetry-instrumentation==0.34b0 opentelemetry-instrumentation-aiohttp-client==0.34b0 opentelemetry-instrumentation-asgi==0.34b0 opentelemetry-sdk==1.13.0 opentelemetry-semantic-conventions==0.34b0 opentelemetry-util-http==0.34b0 packaging==21.3 pandas==1.4.3 parso==0.8.3 pathspec==0.10.1 pep517==0.13.0 pexpect==4.8.0 pickleshare==0.7.5 pip-requirements-parser==31.2.0 pip-tools==6.8.0 pkginfo==1.8.3 platformdirs==2.5.2 pluggy==1.0.0 pre-commit==2.20.0 prometheus-client==0.13.1 prompt-toolkit==3.0.31 protobuf==3.9.2 psutil==5.9.2 ptyprocess==0.7.0 pure-eval==0.2.2 py==1.11.0 pyarrow==8.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 Pygments==2.13.0 pynvml==11.4.1 pyparsing==3.0.9 pytest==7.1.3 python-dateutil==2.8.2 python-dotenv==0.21.0 python-gitlab==2.10.1 python-json-logger==2.0.4 python-multipart==0.0.5 python-semantic-release==7.19.2 pytz==2022.4 PyYAML==5.4.1 pyzmq==24.0.1 readme-renderer==37.2 requests==2.28.1 requests-oauthlib==1.3.1 requests-toolbelt==0.9.1 rfc3986==2.0.0 rich==12.6.0 rsa==4.7.2 s3transfer==0.6.0 schema==0.7.5 scikit-learn==0.23.2 scipy==1.9.1 semver==2.13.0 simple-di==0.1.5 six==1.16.0 smmap==5.0.0 sniffio==1.3.0 stack-data==0.5.1 starlette==0.21.0 threadpoolctl==3.1.0 toml==0.10.2 tomli==2.0.1 tomlkit==0.7.0 tornado==6.2 tqdm==4.64.1 traitlets==5.4.0 twine==3.8.0 typing_extensions==4.3.0 urllib3==1.26.12 uvicorn==0.18.3 virtualenv==20.16.5 watchfiles==0.17.0 wcwidth==0.2.5 webencodings==0.5.1 wrapt==1.14.1 yarl==1.8.1 zipp==3.8.1 ```
mqk commented 1 year ago

I'm happy to submit a PR myself, but I would need some guidance as to the preferred solution (either one of three that I suggested, or something else altogether).

aarnphm commented 1 year ago

Hi @mqk, I think your proposal for only making the assignment when enforce_shape is set makes sense. If possible can you make a PR for that?