datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.88k stars 2.93k forks source link

Unable to do Profiling in air-gapped environment due to absence of Spark+Hadoop Binary - Custom Actions Image is Failing #9425

Closed gopikaops closed 8 months ago

gopikaops commented 11 months ago

Profiles are computed with PyDeequ, which relies on PySpark. Therefore, for computing profiles, we currently require Spark 3.0.3 with Hadoop 3.2 to be installed and the SPARK_HOME and SPARK_VERSION environment variables to be set. The Spark+Hadoop binary can be downloaded here.

Since we work in an air-gapped environment - we found that the code for downloading the binary is available in the DataHub git repo.

This code is in SparkBase.DockerFile - https://github.com/datahub-project/datahub/blob/3e79a1325cf8eca29a8bb818a50762366bfd5d22/metadata-integration/java/spark-lineage/spark-smoke-test/docker/SparkBase.Dockerfile#L4

ARG spark_version=3.0.3
ARG hadoop_version=3.2

RUN curl -sS https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \
    tar -xf spark.tgz && \
    mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \
    mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \
    rm spark.tgz && \
    rm -rf /var/tmp/* /tmp/* /var/lib/apt/lists/*

RUN set -e; \
    pip install JPype1

ENV PYSPARK_PYTHON python3.10
ENV PATH=$PATH:$SPARK_HOME/bin

This is called by build_image.sh - https://github.com/datahub-project/datahub/blob/3e79a1325cf8eca29a8bb818a50762366bfd5d22/metadata-integration/java/spark-lineage/spark-smoke-test/docker/build_images.sh#L22

which is called in setup_spark_smoke_test.sh - https://github.com/datahub-project/datahub/blob/3e79a1325cf8eca29a8bb818a50762366bfd5d22/metadata-integration/java/spark-lineage/spark-smoke-test/setup_spark_smoke_test.sh#L25

which is called in smoke.sh - https://github.com/datahub-project/datahub/blob/3e79a1325cf8eca29a8bb818a50762366bfd5d22/metadata-integration/java/spark-lineage/spark-smoke-test/smoke.sh#L53

which is defined in build.gradle - https://github.com/datahub-project/datahub/blob/3e79a1325cf8eca29a8bb818a50762366bfd5d22/metadata-integration/java/spark-lineage/build.gradle#L150

We added this code to make our custom actions image -

FROM --platform=linux/amd64 acryldata/datahub-actions:v0.0.13

USER root

# -- Layer: Apache Spark

ARG spark_version=3.0.3
ARG hadoop_version=3.2

RUN curl -sS https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \
    tar -xf spark.tgz && \
    mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \
    mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \
    rm spark.tgz && \
    rm -rf /var/tmp/* /tmp/* /var/lib/apt/lists/*

RUN set -e; \
    pip install JPype1

ENV PYSPARK_PYTHON python3.10
ENV PATH=$PATH:$SPARK_HOME/bin

USER datahub

However it led to ingestion failing with following error -

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': 'f81269bc-8b5b-4ac4-b1eb-362030f141c3',
 'infos': ['2023-11-30 18:00:04.812413 INFO: Starting execution for task with name=RUN_INGEST',
           '2023-11-30 18:00:04.816466 INFO: Caught exception EXECUTING task_id=f81269bc-8b5b-4ac4-b1eb-362030f141c3, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 82, in execute\n'
           '    full_log_file = open(f"{self.config.log_dir}/ingestion-{exec_id}.txt", "w")\n'
           "FileNotFoundError: [Errno 2] No such file or directory: '/tmp/datahub/logs/ingestion-f81269bc-8b5b-4ac4-b1eb-362030f141c3.txt'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~

We spoke to the team during the PoC event and they suggested they could make a custom docker image with required binaries for air-gapped environments.

However, we are no sure why ingestion failed when we tried making our own custom image.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

david-leifker commented 10 months ago

I would go through the files being changed as the root user and then at the end change the permissions back to the normal datahub user. @gopikaops

github-actions[bot] commented 9 months ago

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] commented 8 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.