aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.
Apache License 2.0
36 stars 74 forks source link

Run bootstrap script? #29

Closed DLakin01 closed 3 years ago

DLakin01 commented 4 years ago

Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:

`Traceback (most recent call last):
    File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
    @pandas_udf("float", PandasUDFType.GROUPED_AGG)
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
    ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

`

I'm using the latest version of the Sagemaker Python SDK, and have tried using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.

I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!

guoqiao1992 commented 4 years ago

Hi Daniel, sorry for the late response. The container should be able to take the .zip file, if possible could you share the content of your zip file so that we can help you debug?

Regarding to support bootstrap action script, currently we don't support this. It will definitely be included in the future release, but we don't have an ETA yet.

jimmycfa commented 3 years ago

Do you have an example of passing a pypi downloaded .whl file to one of these jobs? Do all the .whls need to be zipped up? I am trying to install pandas and scipy. I tried the following and also was unable to use them: mkdir pkgs && cd pkgs && pip download scipy pandas import glob files = glob.glob('pkgs/*') print(files) spark_processor.run( submit_app="processing.py", submit_py_files=files )

jimmycfa commented 3 years ago

I realized the issue with mine is that it was using the .whl files I passed in. I had to include them individually in the submit_py_files, not in a zip file. With scipy however, there are some pre-built cython and it was erroing out trying to use those. This was effectively the issue I had: Link

DLakin01 commented 3 years ago

Sorrry for the late response! I was actually able to get around this issue even for libraries with .pyc files like numpy and scipy. I created a separate spark_add_dependencies.py file which had only the following code:

import subprocess

subprocess.run(["sudo", "pip", "install", "pandas", "numpy", "pyarrow", "scipy"])

I included that file in the submit_py_files= argument, and that seemed to do the trick! I believe that file is getting executed on each node of the cluster, which installs the proper libraries cluster-wide. I imagine you could use the same approach to install any arbitrary library you might need.

@guoqiaoli1992 Can you confirm if this is a good approach?

jimmycfa commented 3 years ago

@DLakin01 - I tried this but am still failing on the imports. I can see in the logs where that file gets added to each node in the cluster but not where it is executed. Was there something else you had to do, e.g. import that file from your main pyspark file?

jimmycfa commented 3 years ago

I ended up rebuilding the Docker image and adding my additional packages to the requirements.txt.

Note I followed these instructions: https://github.com/aws/sagemaker-spark-container/blob/master/DEVELOPMENT.md

But had to replace $SPARK_REPOSITORY_NAME with $SPARK_REPOSITORY in Step 2 in Running SageMaker Tests

apacker commented 3 years ago

The best solution for running a custom bootstrap script at the moment is to build a custom image that includes your bootstrap script in the container entrypoint (or installs custom dependencies in the Dockerfile).

We are using this feedback to help prioritize the SageMaker roadmap, and the feedback is much appreciated!

If you encounter any other issues, feel free to reopen this issue.

AllardJM commented 2 years ago

It has been about a year for this issue. Is there any progress?

aabid0193 commented 1 year ago

bump, would like to know if there is any progress being made on this or supporting it in a similar fashion as defined here: https://github.com/aws/sagemaker-python-sdk/issues/1248, which is also an open issue

govind-govind commented 1 year ago

Any progress on this? This is limiting PySparkProcessor usability almost for everyone who attempts to use it!!

csotomon commented 1 year ago

Any progress on this?