Open oren4322 opened 3 years ago
Hi, @oren4322 I also have a use case for pandas inside the spark container. Instead of building a new image can't I just extend the previous distributed one and install the packages using pip in the new docker file?
Something like -->
FROM 759080221371.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-spark-processing:3.0-cpu-py37-v1.2
pip install pandas
It is possible but installing such libraries each time instead of having them on the image will result in longer execution time. Also it seems as a common enough requirement it should be done on the aws image.
On Wed, Jun 30, 2021, 21:34 Shivansh Narayan @.***> wrote:
Hi, @oren4322 https://github.com/oren4322 I also have a use case for pandas inside the spark container. Instead of building a new image can't I just extend the previous distributed one and install the packages using pip in the new docker file?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/sagemaker-spark-container/issues/53#issuecomment-871636906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIEVELJ7E6G2OYFZMSDAVDTVNPTZANCNFSM4XO3XV6Q .
@narayanshivansh49 is it as simple as creating a dockerfile with just what you have above, build it and register it on ECR and then use the image uri in say the pyspark processor class?
@BrianMiner yes it is as simple as that, I confirm.
@BrianMiner or @govind-govind Do you have an example of the docker file?
When using the pandas udf additional libraries need to be install: "pandas==0.24.2","requests==2.21.0","pyarrow==0.15.1","pytz==2021.1","six==1.15.0","python-dateutil==2.8.1","numpy==1.16.5" Since the libraries are pretty heavy this requires almost always create a new image instead of using the pre built one from this repo (or installing them via pip in the container). The pandas udf is the recommended udf to be used and thus including these libraries in the image is preferred.