IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
265 stars 124 forks source link

[Feature] Base spark image build is very slow and impacting ci/cd #606

Open daw3rd opened 1 month ago

daw3rd commented 1 month ago

Search before asking

Component

Library/core

Feature

While working, the build of the base spark image used by transforms enabled in the spark runtime, is very slow - about 20 minutes in ci/cd builds. With the recent change on ci/cd to run each transform separately, this means that each spark-based transform is doing this. Not too bad when only a transform is changed, but when the core library is changed, all transforms are built so that this is problem is even more painful.

It seems to be happening on this step

RUN wget https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}.tgz && \
    tar xf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION}.tgz -C /opt && \
    mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MAJOR_VERSION} /opt/spark

ans specifically the wget.

To build the image

cd data-processing-lib/spark
make image

Are you willing to submit a PR?

daw3rd commented 1 month ago

We may have discussed, but using a base spark image would probably help this.

cmadam commented 1 month ago

Why is the base image rebuilt every time a transform image is built? By default, the transform image uses a base image from a docker registry. That does not require rebuilding the base image for every CI/CD run. It sounds like something needs to be changed in the CI/CD pipeline itself.

blublinsky commented 1 month ago

++++1 to @cmadam