Usage of pandas udf requires installations of additional heavy libraries which should reside on the docker image

aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.

Apache License 2.0

36 stars 74 forks source link

Usage of pandas udf requires installations of additional heavy libraries which should reside on the docker image #53

Open oren4322 opened 3 years ago

oren4322 commented 3 years ago

When using the pandas udf additional libraries need to be install: "pandas==0.24.2","requests==2.21.0","pyarrow==0.15.1","pytz==2021.1","six==1.15.0","python-dateutil==2.8.1","numpy==1.16.5" Since the libraries are pretty heavy this requires almost always create a new image instead of using the pre built one from this repo (or installing them via pip in the container). The pandas udf is the recommended udf to be used and thus including these libraries in the image is preferred.

shivansh-narayan commented 3 years ago

Hi, @oren4322 I also have a use case for pandas inside the spark container. Instead of building a new image can't I just extend the previous distributed one and install the packages using pip in the new docker file?

Something like -->

FROM  759080221371.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-spark-processing:3.0-cpu-py37-v1.2
pip install pandas

oren4322 commented 3 years ago

It is possible but installing such libraries each time instead of having them on the image will result in longer execution time. Also it seems as a common enough requirement it should be done on the aws image.

On Wed, Jun 30, 2021, 21:34 Shivansh Narayan @.***> wrote:

Hi, @oren4322 https://github.com/oren4322 I also have a use case for pandas inside the spark container. Instead of building a new image can't I just extend the previous distributed one and install the packages using pip in the new docker file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/sagemaker-spark-container/issues/53#issuecomment-871636906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIEVELJ7E6G2OYFZMSDAVDTVNPTZANCNFSM4XO3XV6Q .

BrianMiner commented 2 years ago

@narayanshivansh49 is it as simple as creating a dockerfile with just what you have above, build it and register it on ECR and then use the image uri in say the pyspark processor class?

govind-govind commented 1 year ago

@BrianMiner yes it is as simple as that, I confirm.

csotomon commented 1 year ago

@BrianMiner or @govind-govind Do you have an example of the docker file?