Boto3 incompatible with python zip import

ADH-LukeBollam commented 5 years ago

One of Python's useful features is its ability to load modules from a .zip archive (PEP here), allowing you to package up multiple dependencies into a single file.

Boto breaks when trying to import it from a .zip, throwing:

  File "C:\code sandbox\boto.zip\boto3\session.py", line 263, in client
  File "C:\code sandbox\boto.zip\botocore\session.py", line 799, in create_client
  File "C:\code sandbox\boto.zip\botocore\session.py", line 668, in _get_internal_component
  File "C:\code sandbox\boto.zip\botocore\session.py", line 870, in get_component
  File "C:\code sandbox\boto.zip\botocore\session.py", line 150, in create_default_resolver
  File "C:\code sandbox\boto.zip\botocore\loaders.py", line 132, in _wrapper
  File "C:\code sandbox\boto.zip\botocore\loaders.py", line 424, in load_data
botocore.exceptions.DataNotFoundError: Unable to load data for: endpoints

How to Reproduce:

Create a .zip containing boto3 and botocore
Create a .py file in the same directory as the zip (access keys removed for obvious reasons):
```
sys.path.insert(0, 'boto.zip')
import boto3
```

s3 = boto3.client('s3', aws_access_key_id='access_key', aws_secret_access_key='secret_key')


3. Run

Tested on Python 3.6.7
boto3 1.9.39
botocore 1.12.39

joguSD commented 5 years ago

Confirmed. Our data loaders can't handle when being run from a zip. Specifically, we try to search for the data in the following directory:

'.../botocore.zip/botocore/data'

Which fails our isdir check and is thus skipped.

Marking this as a feature request.

ADH-LukeBollam commented 5 years ago

What are the odds of getting this implemented? Its preventing us from distributing boto3, which makes it very hard to provide a package that depends on it in PySpark.

gliptak commented 5 years ago

https://stackoverflow.com/a/22646702 has a snippet processing a zip.

krish5989 commented 4 years ago

is this issue resolved? Am also stuck with loading boto3 from .zip file when using --py-files option in spark2-submit. Appreciate any help to overcome this situation

philboltt commented 4 years ago

pytz has a similar issue reading timezone data in the zoneinfo folder from a packaged directory. To get around this is uses pkg_resources.resource_stream from setuptools - https://github.com/stub42/pytz/blob/7b1a844c8ecf2996142ac0eb32201b676e9dcb9a/src/pytz/__init__.py#L101

https://setuptools.readthedocs.io/en/latest/pkg_resources.html

It adds setuptools as a dependency when distributing as a zip, but at least it works. Would be great to have a fix for this. Workarounds are needlessly ugly.

gliptak commented 4 years ago

Is https://github.com/boto/boto3/issues/1008 also a duplicate?

gliptak commented 4 years ago

I submitted PR https://github.com/boto/botocore/pull/1969 Could a committer review?

shadowdsp commented 4 years ago

@gliptak Hello, I encounter this problem now, can we reopen https://github.com/boto/botocore/pull/1969 and fix this problem?

gliptak commented 4 years ago

@shadowdsp we need a commiter's help on that repo to move forward

wolfch-elsevier commented 3 years ago

Hi @gliptak - I tried your PR as a patch to botocore and zipped up the patched boto3/ botocore to s3cip_deps.zip and submitted to Amazon EMR (pyspark), via :

 spark-submit --deploy-mode cluster --py-files s3://data-nonprod/emr_demo/s3cip_deps.zip s3://data-nonprod/emr_demo/s3cip.py

and got:

   File "./s3cip_deps.zip/botocore/loaders.py", line 421, in load_data
    for possible_path in self._potential_locations(name):
  File "./s3cip_deps.zip/botocore/loaders.py", line 436, in _potential_locations
    path = pkg_resources.resource_filename(path1, 'data')
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1226, in resource_filename
    self, resource_name
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1722, in get_resource_filename
    "resource_filename() only supported for .egg, not .zip"
NotImplementedError: resource_filename() only supported for .egg, not .zip

gliptak commented 3 years ago

@wolfch-elsevier you might try removing that folder the Python search path or this has a pointer

https://stackoverflow.com/questions/25872134/cxfreeze-error-resource-filename-only-supported-for-egg-not-zipp

wolfch-elsevier commented 3 years ago

@wolfch-elsevier you might try removing that folder the Python search path or this has a pointer

https://stackoverflow.com/questions/25872134/cxfreeze-error-resource-filename-only-supported-for-egg-not-zipp

Hi, thanks. No but this directory, /botocore.zip/botocore/data is an integral part of the botocore package. I thought your PR was a workaround for this?

fmxleg commented 3 years ago

@wolfch-elsevier I need to fix the same problem. Have you found any solution for that?

wolfch-elsevier commented 3 years ago

@fmxleg Unfortunately, no. I'm really surprised that Amazon hasn't come up with a working example to help promote their EMR ("cloudized" Apache Spark). I mean their example creates an RDD from a CSV in S3 - and that works (so it prob uses Scala or Java AWS-SDK to access) However, I need to read XML files which are documents, not series of records.

I tried this solution, which uses the spark.yarn.dist.archives config property. It's supposed to unzip the archive when it's pushed to the worker nodes: https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work

It didn't work for me.

dsonavane-rgare commented 3 years ago

Found a work around for this. You can pass spark conf args to have spark unzip the dependencies and include in path, something like this,

    --conf spark.yarn.dist.archives=s3://<bucket+path>/sparkapp.zip#deps" \
    --conf spark.yarn.appMasterEnv.PYTHONPATH=deps" \
    --conf spark.executorEnv.PYTHONPATH=deps" \

Worked with EMR 6.2.0 and Python 3.7.9

kojiromike commented 3 years ago

I have made a new PR, boto/botocore#2437, to attempt to resubmit boto/botocore#1969

nickolashkraus commented 3 years ago

This is also an issue for SaltStack modules:

Python 2.3 and higher allows developers to directly import Zip archives containing Python code.

Source

Salt execution modules are imported using zipimporter:

mod = zipimporter(fpath).load_module(name)

If one were to create a Zip archive containing botocore, the following error will occur when attempting to execute the module:

boto3.exceptions.ResourceNotExistsError: The 'dynamodb' resource does not exist.

This is due to the fact that the botocore loader (botocore/loaders.py) checks the path botocore/data/ for model files. If the path to botocore is a Zip archive, this check fails and botocore fails to load the models (EC2, S3, DynamoDB, etc.).

This renders SaltStack modules distributed as Zip modules using botocore useless.

alete89 commented 2 years ago

Hi! I'm experiencing this error trying to use boto3 in modules within a zip dependency files on EMR. I think this worth a fix.

kojiromike commented 2 years ago

The fix is in boto/botocore#2437, but someone from AWS will have to review, approve and merge it.

alete89 commented 2 years ago

Found a work around for this. You can pass spark conf args to have spark unzip the dependencies and include in path, something like this,
  --conf spark.yarn.dist.archives=s3://<bucket+path>/sparkapp.zip#deps" \
  --conf spark.yarn.appMasterEnv.PYTHONPATH=deps" \
  --conf spark.executorEnv.PYTHONPATH=deps" \
Worked with EMR 6.2.0 and Python 3.7.9

Hey @dsonavane-rgare I'm trying this without success. Can you elaborate a bit more? This is how I was sending my file and deps (this throws boto3 not found because one of my zipped files uses boto3):

spark-submit --py-files s3://<bucket>/code/spark/dependencies.zip s3://<bucket>/code/spark/job.py args

This is what I've tried now, based on your example:

spark-submit --conf spark.yarn.dist.archives=s3://<bucket>/code/spark/dependencies.zip#deps --conf spark.yarn.appMasterEnv.PYTHONPATH=deps --conf spark.executorEnv.PYTHONPATH=deps s3://<bucket>/code/spark/job.py args

and this as well:

spark-submit --py-files s3://<bucket>/code/spark/dependencies.zip --conf spark.yarn.dist.archives=s3://<bucket>/code/spark/dependencies.zip#deps --conf spark.yarn.appMasterEnv.PYTHONPATH=deps --conf spark.executorEnv.PYTHONPATH=deps s3://<bucket>/code/spark/job.py 2021-12-01

Thanks

MatheusAnciloto commented 6 months ago

Does anyone have a work around for this?

kojiromike commented 6 months ago

Does anyone have a work around for this?

The only thing that ever worked for me was to run on systems with boto3 already installed, and exposed to the PYTHONPATH.

boto / boto3

Boto3 incompatible with python zip import #1770