Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.23k stars 773 forks source link

`metaflow_environment` dependencies can override or conflict with those set by the Batch docker image, breaking user code #906

Open ryan-williams opened 2 years ago

ryan-williams commented 2 years ago

Pasting the README from runsascoded/mf-pip-issue, where I have some repro files as well:

Metaflow/pip/Batch issue

Metaflow runs pip install awscli … boto3 while setting up task environements in Batch, which can break aiobotocore<2.1.0.

Repro

Docker image runsascoded/mf-pip-issue-batch (batch.dockerfile) pins recent versions of botocore and aiobotocore:

Local mode: ✅

They work fine together normally; runsascoded/mf-pip-issue-local (local.dockerfile) runs s3_flow_test.py successfully (in "local" mode):

docker run -it --rm runsascoded/mf-pip-issue-local
# Metaflow 2.4.8 executing S3FlowTest for user:user
# …
# 2022-01-16 21:21:59.162 Done!

Batch mode: ❌

However, with a Metaflow Batch queue configured:

python s3_flow_test.py run --with batch:image=runsascoded/mf-pip-issue-batch

fails with:

AttributeError: 'AioClientCreator' object has no attribute '_register_lazy_block_unknown_fips_pseudo_regions'

due to a version mismatch (botocore>=1.23.0, aiobotocore<2.1.0).

Version mismatch

botocore removed ClientCreator._register_lazy_block_unknown_fips_pseudo_regions in 1.23.0, and aiobotocore only updated to botocore>=1.23.0 in 2.1.0, so aiobotocore<2.1.0 requires botocore<1.23.0, otherwise reading from S3 via Pandas will raise this error.

Cause

The version mismatch is caused by Metaflow running pip install awscli … boto3 while setting up the task environment (in Batch and I believe k8s). If awscli or boto3 aren't both installed already, it will pick a recent version to install, see that a recent botocore is also required by that version, and update botocore to >=1.23.0 while aiobotocore is still <2.1.0, breaking Pandas→S3 reading.

Simpler example

Here we see pip install awscli break aiobotocore<2.1.0 directly (in the same image as above):

docker run --rm --entrypoint bash runsascoded/mf-pip-issue-batch -c '
  echo "Before \`pip install awscli\`:" && \
  pip list | grep botocore && \
  pip install awscli -qqq && \
  echo -e "----\nAfter \`pip install awscli\`:" && \
  pip list | grep botocore
' 2>/dev/null 
# Before `pip install awscli`:
# aiobotocore        1.4.2     # ✅
# botocore           1.20.106  # ✅
# ----
# After `pip install awscli`:
# aiobotocore        1.4.2     # ✅
# botocore           1.23.37   # ❌

Here, pip install awscli upgraded botocore to a version that's incompatible with the already-installed aiobotocore.

Workaround

The simplest workaround I've found is to ensure Metaflow's pip install awscli click requests boto3 command no-ops, by having some version of those libraries already installed in the image. They should also have consistent transitive dependency versions, otherwise pip install will "help" with those as well).

Scratch

These seem like the minimal Metaflow configs to submit to Batch (and reproduce the issue):

{
  "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:…",
  "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::…",
  "METAFLOW_DEFAULT_DATASTORE": "s3",
  "METAFLOW_DATASTORE_SYSROOT_S3": "s3://<bucket>/metaflow",
  "METAFLOW_DATATOOLS_SYSROOT_S3": "s3://<bucket>/data"
}

Docker build commands:

docker build -f batch.dockerfile -t runsascoded/mf-pip-issue-batch .
docker build -f local.dockerfile -t runsascoded/mf-pip-issue-local .
savingoyal commented 2 years ago

@ryan-williams The pip install awscli ... should be a no-op for any of the libraries that are already present in the image.

ryan-williams commented 2 years ago

Yes, but if e.g. awscli isn't already installed, installing it can change the versions of things that are already installed, including breaking them. The "Simpler example" section above illustrates this most directly.

ryan-williams commented 2 years ago

To be clear, it's possible for the following to happen:

I don't know what the solution should be, but it is surprising and undesirable behavior, and enabled by a breaking change in boto in November that I suspect we will see wash around the ecosystem for some time to come, so it's good to be aware of this specific interaction with Metaflow's step-env setup logic.

ryan-williams commented 1 year ago

Ran into this again today. Here's an updated link to the offending line, in 2.8.2.

Here's a simple repro:

1. User installs boto/s3fs/pandas, successfully reads CSV from S3

# mf1.dockerfile
FROM python:3.9
WORKDIR /root
RUN pip install \
    boto3==1.24.59 \
    botocore==1.27.59 \
    aiobotocore==2.4.2 \
    s3fs==2023.1.0 \
    pandas
# ✅ works fine, reads publicly-accessible CSV from S3. boto/s3fs/pandas versions are mutually compatible.
ENTRYPOINT [ "python", "-c", "import pandas as pd; print(pd.read_csv('s3://ctbk/csvs/JC-202301-citibike-tripdata.csv'))" ]
docker build -tmf1 -fmf1.dockerfile .
docker run --rm -it mf1
✅ works fine, prints DataFrame ``` ride_id rideable_type ... end_lng member_casual 0 0905B18B365C9D20 classic_bike ... -74.044247 member 1 B4F0562B05CB5404 electric_bike ... -74.041664 member 2 5ABF032895F5D87E classic_bike ... -74.042521 member 3 E7E1F9C53976D2F9 classic_bike ... -74.044247 member 4 323165780CA0734B classic_bike ... -74.042884 member ... ... ... ... ... ... 56070 17CD2F4ABD4F6785 classic_bike ... -74.050389 member 56071 D75D12846E6838D0 electric_bike ... -74.050389 member 56072 36387397177CAA80 electric_bike ... -74.050389 member 56073 B66278F45420CFA0 classic_bike ... -74.030305 member 56074 230153A8D1F2D5F7 classic_bike ... -74.030305 member [56075 rows x 13 columns] ```

2. Metaflow runs pip install awscli boto3, breaking aiobotocore/s3fs/pandas

# mf2.dockerfile
FROM mf1
RUN pip install awscli boto3  # 💥 this breaks the user's installs; `pd.read_csv("s3://…")` no longer works

Test image:

docker build -tmf2 -fmf2.dockerfile .
docker run --rm -it mf2
pd.read_csv raises PermissionError: Forbidden ``` Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 112, in _error_wrapper return await func(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/aiobotocore/client.py", line 358, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden The above exception was the direct cause of the following exception: Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 577, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__ self._engine = self._make_engine(f, self.engine) File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine self.handles = get_handle( File "/usr/local/lib/python3.9/site-packages/pandas/io/common.py", line 716, in get_handle ioargs = _get_filepath_or_buffer( File "/usr/local/lib/python3.9/site-packages/pandas/io/common.py", line 425, in _get_filepath_or_buffer file_obj = fsspec.open( File "/usr/local/lib/python3.9/site-packages/fsspec/core.py", line 134, in open return self.__enter__() File "/usr/local/lib/python3.9/site-packages/fsspec/core.py", line 102, in __enter__ f = self.fs.open(self.path, mode=mode) File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1135, in open f = self._open( File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 649, in _open return S3File( File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 2024, in __init__ super().__init__( File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1491, in __init__ self.size = self.details["size"] File "/usr/local/lib/python3.9/site-packages/fsspec/spec.py", line 1504, in details self._details = self.fs.info(self.path) File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 114, in wrapper return sync(self.loop, func, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 99, in sync raise return_result File "/usr/local/lib/python3.9/site-packages/fsspec/asyn.py", line 54, in _runner result[0] = await coro File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 1238, in _info out = await self._call_s3( File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 339, in _call_s3 return await _error_wrapper( File "/usr/local/lib/python3.9/site-packages/s3fs/core.py", line 139, in _error_wrapper raise err PermissionError: Forbidden ```

pip install awscli boto3 explicitly logs an ERROR about breaking aiobotocore:

docker run --rm -it --entrypoint pip mf1 install awscli boto3
# …
# ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
# aiobotocore 2.4.2 requires botocore<1.27.60,>=1.27.59, but you have botocore 1.29.110 which is incompatible.
# Successfully installed PyYAML-5.4.1 awscli-1.27.110 boto3-1.26.110 botocore-1.29.110 colorama-0.4.4 docutils-0.16 pyasn1-0.4.8 rsa-4.7.2

Simplest workaround remains to make sure both awscli and boto3 are both installed in any image you pass to Metaflow Batch mode, but Metaflow could/should do something more careful/correct here.

saikonen commented 2 months ago

two features related to this have recently been released in #1972 We have gotten rid of the awscli dependency completely so less possibility for dependency conflicts.

For other use cases that require completely disabling the dependency installs, setting the METAFLOW_SKIP_INSTALL_DEPENDENCIES environment variable in the execution environment will do this. When using this, the execution environment needs to have the required bootstrapping dependencies available out of the box.

can this issue be considered closed with the latest changes?