delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

[Feature Request] The delta-spark dependency pyspark package is too large #3789

Open melin opened 1 month ago

melin commented 1 month ago

To install the delta-spark python package on the spark image, you need to download pyspark.zip. pyspark.zip has more than 370 MB. Can I avoid increasing the size of the spark image?

FROM spark:3.5.3-scala2.12-java11-ubuntu

USER root

RUN set -ex; \
    apt-get update; \
    apt-get install -y python3 python3-pip; \
    rm -rf /var/lib/apt/lists/*

RUN pip install requests aspectlib delta-spark;

ADD build/docker/aspectjweaver-1.9.22.1.jar /opt/spark

ADD build/docker/jars/ \
    build/docker/datatunnel-3.5.0/ \
    spark-jobserver-driver/target/spark-jobserver-driver-3.5.0.jar \
    spark-jobserver-extensions/target/spark-jobserver-extensions-3.5.0.jar /opt/spark/jars/
USER spark
Pshak-20000 commented 4 weeks ago

Hi, To minimize the size of the Spark image while adding delta-spark, I suggest we consider:

Using a lighter base image. Installing only the necessary dependencies instead of the entire pyspark. Implementing multi-stage builds to keep only essential files. Cleaning up temporary files and caches after installations.