eto-ai / rikai

Parquet-based ML data format optimized for working with unstructured data
https://rikai.readthedocs.io/en/latest/
Apache License 2.0
136 stars 19 forks source link

Clean up docker image #699

Open changhiskhan opened 1 year ago

changhiskhan commented 1 year ago

Docker image was broken but is too complicated to maintain. I simplified the build using 2 builder images (1 for jar 1 for python wheels). I've also added some cleanup to reduce the final image size (~4.5GB now)

In the image itself I only include coco and mojito.

I've also removed scala 2.13 from the GH actions matrix since we're stuck on 2.12 with pyspark for now.

One thing that might be an annoyance is that using the jar builder I'm putting the jar directly into the spark classpath so I've removed the part in the notebooks that's downloading the rikai jar as a separate dependency. This means if you're running the notebooks on their own you'll need to add it back. Happy to chat if this is a problem.

da-liii commented 1 year ago

Why not upgrading pyspark from 3.1.2 to 3.1.3 in separate pull request?

da-liii commented 1 year ago

We are using Spark 3.2.x. https://github.com/eto-ai/rikai/issues/684

I'd like to deprecate Spark 3.1.x.

changhiskhan commented 1 year ago

we could update to 3.2.x - is Tubi all on 3.2.x now?

@ffcai what version of spark are you guys using?

da-tubi commented 1 year ago

we could update to 3.2.x - is Tubi all on 3.2.x now?

Yes. Using Databricks, we have to upgrade the databricks runtime version because Databricks are deprecating the old one.

da-liii commented 1 year ago

The first error is that no space left on device. I increased the disk quota and it works. And here is the second error:

 => ERROR [whl_builder 6/6] RUN pip3 wheel -r /opt/rikai/python/docker-requirements.txt                                                                                                                               1283.8s
------
 > [whl_builder 6/6] RUN pip3 wheel -r /opt/rikai/python/docker-requirements.txt:
#13 81.40 Collecting torch>=1.8.1
#13 82.97   Downloading torch-1.12.0-cp39-cp39-manylinux1_x86_64.whl (776.3 MB)
#13 1277.4      ━━━━━━━━━━━━━━                        299.6/776.3 MB 664.6 kB/s eta 0:11:58
#13 1279.3 ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
#13 1279.3     torch>=1.8.1 from https://files.pythonhosted.org/packages/8f/27/addb0019d7aa3704576ca9c055f7566a3db31f95110e55b31173b87aec4a/torch-1.12.0-cp39-cp39-manylinux1_x86_64.whl#sha256=844f1db41173b53fe40c44b3e04fcca23a6ce00ac328b7099f2800e611766845 (from -r /opt/rikai/python/docker-requirements.txt (line 2)):
#13 1279.3         Expected sha256 844f1db41173b53fe40c44b3e04fcca23a6ce00ac328b7099f2800e611766845
#13 1279.3              Got        45984e61e215ca5985f60c7a64444cab4dcc7dfb9588be4017f7f82cb37b455d
#13 1279.3
#13 1280.8 WARNING: You are using pip version 22.0.3; however, version 22.2 is available.
#13 1280.8 You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
------
executor failed running [/bin/sh -c pip3 wheel -r /opt/rikai/python/docker-requirements.txt]: exit code: 1
ERROR: Service 'quickstart' failed to build : Build failed
da-liii commented 1 year ago

With this patch indicated by @changhiskhan :

diff --git a/Dockerfile b/Dockerfile
index e244510..be63155 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -18,6 +18,7 @@ COPY ./python /opt/rikai/python
 COPY ./README.md /opt/rikai/README.md
 WORKDIR /opt/rikai/python
 RUN python3 setup.py bdist_wheel
+RUN pip3 cache purge
 RUN pip3 wheel -r /opt/rikai/python/docker-requirements.txt

 FROM apache/spark-py:v${SPARK_VERSION} AS jupyter

It works fine for me now.