Adding new jar to Spark is not detected

j-adamczyk commented 1 year ago

Name and Version

bitnami/spark

What steps will reproduce the bug?

I added Postgres and Elasticsearch jars to Spark, following the documentation:

FROM bitnami/spark

ENV POSTGRESQL_JAR="postgresql-42.5.1.jar" \
    ELASTICSEARCH_JAR="elasticsearch-spark-30_2.12-8.6.0.jar"

USER root
RUN apt-get update && \
    apt-get -y install --no-install-recommends curl

USER 1001
RUN curl https://jdbc.postgresql.org/download/${POSTGRESQL_JAR} \
        --output /opt/bitnami/spark/jars/${POSTGRESQL_JAR} && \
    curl https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/8.6.0/${ELASTICSEARCH_JAR} \
        --output /opt/bitnami/spark/jars/${ELASTICSEARCH_JAR}

In Jupyter Notebook (Python) I have tried:

spark = (
    SparkSession
    .builder
    .master(spark_url)
    .appName("ABP Coupons Usages")
    .config("spark.jars", "elasticsearch-spark-30_2.12-8.6.0.jar")
    .getOrCreate()
)

df = (
    spark.read
    .format("es")
    .option("es.read.metadata", "false")
    .option("es.nodes.wan.only", "true")
    .option("es.net.ssl", "false")
    .option("es.port", "9200")
    .option("es.nodes", "http://localhost")

    .load("events-index")
)

What is the expected behavior?

Find appropriate jars and connect.

What do you see instead?

I receive an error:

Py4JJavaError: An error occurred while calling o55.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: es. Please find packages at
https://spark.apache.org/third-party-projects.html

Additional information

I've also tried org.elasticsearch.spark.sql instead of es, same error.
print(spark.sparkContext._jsc.sc().listJars()) confirms that no additional jars are loaded
Manually checking Docker via docker exec -it dockerID /bin/bash confirmed that jars are indeed in /jars directory. They have the same permissions as regular Spark jars, except for public write. I changed that via chmod 777, but result is the same.

dgomezleon commented 1 year ago

Hello @j-adamczyk,

I can see you face errors with ES jar. What version of Spark are you using? Did you try a different jar? Could please give us more details about your environent and Jupyter Notebook configuration?

j-adamczyk commented 1 year ago

I am using bitnami/spark Spark image. I have also tried 3.3.1, and a few other 3.X. I did not try different jar, since I need to use the one compatible with my Elasticsearch version, which is 8.6.0, and this is the same as the jar version. My environment for Jupyter Notebook is also in Docker, which is in the Docker Compose along with Spark.

Docker Compose:

version: '3'

services:
  db:
    image: postgres:15
    restart: always
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: db_dev
    ports:
      - "5432:5432"
    volumes:
      - db:/var/lib/postgresql/data

  elasticsearch:
    image: elasticsearch:8.6.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9211:9300"
    volumes:
      - elasticsearch:/usr/share/elasticsearch/data

  spark-master:
    image: bitnami/spark:3.3.1
    environment:
      SPARK_MODE: master
      SPARK_RPC_AUTHENTICATION_ENABLED: no
      SPARK_RPC_ENCRYPTION_ENABLED: no
      SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED: no
      SPARK_SSL_ENABLED: no
    ports:
      - "8080:8080"
  spark-worker:
    image: bitnami/spark:3.3.1
    environment:
      SPARK_MODE: worker
      SPARK_MASTER_URL: "spark://spark:7077"
      SPARK_WORKER_MEMORY: 1G
      SPARK_WORKER_CORES: 1
      SPARK_RPC_AUTHENTICATION_ENABLED: no
      SPARK_RPC_ENCRYPTION_ENABLED: no
      SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED: no
      SPARK_SSL_ENABLED: no

  jupyter_notebook:
    image: data_processing_jupyter_notebook
    build:
      # move to the project root to start Jupyter Notebook there
      context: ../
      dockerfile: docker/Dockerfile_Jupyter
    env_file: .env
    ports:
      - "8888:8888"
    command: "jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token=''"
    volumes:
      - $PWD:/usr/data_processing
    working_dir: "/usr/data_processing"
    profiles: ["jupyter_notebook"]

volumes:
  db:
    driver: local
  elasticsearch:
    driver: local
  jupyter_notebook:
    driver: local

Jupyter Notebook Dockerfile:

FROM python:3.10-slim-bullseye

ENV POETRY_VIRTUALENVS_CREATE=false \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

USER root
RUN apt-get update && \
    apt-get -y install --no-install-recommends \
    default-jdk \
    gcc \
    libc-dev \
    libpq-dev \
    python3-dev

COPY ../poetry.lock ./poetry.lock
COPY ../pyproject.toml ./pyproject.toml
RUN pip install poetry
RUN poetry install --no-interaction

CMD "jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token=''"

My dependencies for Jupyter Notebook are managed via Poetry. My pyproject.toml file:

[tool.poetry]
name = "data_processing"
version = "1.0.0"
description = ""
authors = ["Data team"]

[tool.poetry.dependencies]
python = "^3.10"

numpy = "1.24.*"
pandas = "1.5.*"
psycopg2 = "2.*"
pyspark = "3.*"
sqlalchemy = "2.*"

[tool.poetry.dev-dependencies]
autoflake = "2.*"
black = {extras = ["jupyter"], version = "22.*"}
elasticsearch = "8.*"
isort = "5.*"
jupyter = "*"
pre-commit = "3.*"
pytest = "7.*"
pyupgrade = "3.*"
pyyaml = "6.*"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

My directory structure:

- docker:
  - docker-compose.yml
  - Dockerfile_Jupyter
- Makefile
- poetry.lock
- pyproject.toml
- notebook.ipynb
- elasticsearch-spark-30_2.12-8.6.0.jar

I am running Docker Compose using Makefile:

start:
    COMPOSE_DOCKER_CLI_BUILD=1  DOCKER_BUILDKIT=1 \
    docker compose --file docker/docker-compose.yml --profile jupyter_notebook up --build --detach

dgomezleon commented 1 year ago

Hi @j-adamczyk

Spark JARs functionality seems to be working since jars are placed in /jars directory. I would try with a different JAR in order to check if the issue is related to the JAR itself or jupyter notebook.

It seems a very specific use case difficult to reproduce on our side and very tied to your scenario.

For information regarding the application itself, customization of the content within the application, or questions about the use of the technology or infrastructure; we highly recommend checking forums and user guides made available by the project behind the application or the technology.

That said, we will keep this ticket open until the stale bot closes it just in case someone from the community adds some valuable info.

github-actions[bot] commented 1 year ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 1 year ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

bitnami / containers