Closed DLakin01 closed 3 years ago
Hi Daniel, sorry for the late response. The container should be able to take the .zip file, if possible could you share the content of your zip file so that we can help you debug?
Regarding to support bootstrap action script, currently we don't support this. It will definitely be included in the future release, but we don't have an ETA yet.
Do you have an example of passing a pypi downloaded .whl file to one of these jobs? Do all the .whls need to be zipped up? I am trying to install pandas and scipy. I tried the following and also was unable to use them:
mkdir pkgs && cd pkgs && pip download scipy pandas
import glob files = glob.glob('pkgs/*') print(files)
spark_processor.run( submit_app="processing.py", submit_py_files=files )
I realized the issue with mine is that it was using the .whl files I passed in. I had to include them individually in the submit_py_files, not in a zip file. With scipy however, there are some pre-built cython and it was erroing out trying to use those. This was effectively the issue I had: Link
Sorrry for the late response! I was actually able to get around this issue even for libraries with .pyc
files like numpy
and scipy
. I created a separate spark_add_dependencies.py
file which had only the following code:
import subprocess
subprocess.run(["sudo", "pip", "install", "pandas", "numpy", "pyarrow", "scipy"])
I included that file in the submit_py_files=
argument, and that seemed to do the trick! I believe that file is getting executed on each node of the cluster, which installs the proper libraries cluster-wide. I imagine you could use the same approach to install any arbitrary library you might need.
@guoqiaoli1992 Can you confirm if this is a good approach?
@DLakin01 - I tried this but am still failing on the imports. I can see in the logs where that file gets added to each node in the cluster but not where it is executed. Was there something else you had to do, e.g. import that file from your main pyspark file?
I ended up rebuilding the Docker image and adding my additional packages to the requirements.txt.
Note I followed these instructions: https://github.com/aws/sagemaker-spark-container/blob/master/DEVELOPMENT.md
But had to replace $SPARK_REPOSITORY_NAME
with $SPARK_REPOSITORY
in Step 2 in Running SageMaker Tests
The best solution for running a custom bootstrap script at the moment is to build a custom image that includes your bootstrap script in the container entrypoint (or installs custom dependencies in the Dockerfile).
We are using this feedback to help prioritize the SageMaker roadmap, and the feedback is much appreciated!
If you encounter any other issues, feel free to reopen this issue.
It has been about a year for this issue. Is there any progress?
bump, would like to know if there is any progress being made on this or supporting it in a similar fashion as defined here: https://github.com/aws/sagemaker-python-sdk/issues/1248, which is also an open issue
Any progress on this? This is limiting PySparkProcessor usability almost for everyone who attempts to use it!!
Any progress on this?
Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:
`
I'm using the latest version of the Sagemaker Python SDK, and have tried using the
submit_py_files
parameter inPySparkProcessor.run()
to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!