Closed j-adamczyk closed 1 year ago
Hello @j-adamczyk,
I can see you face errors with ES jar. What version of Spark are you using? Did you try a different jar? Could please give us more details about your environent and Jupyter Notebook configuration?
I am using bitnami/spark
Spark image. I have also tried 3.3.1, and a few other 3.X. I did not try different jar, since I need to use the one compatible with my Elasticsearch version, which is 8.6.0, and this is the same as the jar version. My environment for Jupyter Notebook is also in Docker, which is in the Docker Compose along with Spark.
Docker Compose:
version: '3'
services:
db:
image: postgres:15
restart: always
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: db_dev
ports:
- "5432:5432"
volumes:
- db:/var/lib/postgresql/data
elasticsearch:
image: elasticsearch:8.6.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
- "9211:9300"
volumes:
- elasticsearch:/usr/share/elasticsearch/data
spark-master:
image: bitnami/spark:3.3.1
environment:
SPARK_MODE: master
SPARK_RPC_AUTHENTICATION_ENABLED: no
SPARK_RPC_ENCRYPTION_ENABLED: no
SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED: no
SPARK_SSL_ENABLED: no
ports:
- "8080:8080"
spark-worker:
image: bitnami/spark:3.3.1
environment:
SPARK_MODE: worker
SPARK_MASTER_URL: "spark://spark:7077"
SPARK_WORKER_MEMORY: 1G
SPARK_WORKER_CORES: 1
SPARK_RPC_AUTHENTICATION_ENABLED: no
SPARK_RPC_ENCRYPTION_ENABLED: no
SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED: no
SPARK_SSL_ENABLED: no
jupyter_notebook:
image: data_processing_jupyter_notebook
build:
# move to the project root to start Jupyter Notebook there
context: ../
dockerfile: docker/Dockerfile_Jupyter
env_file: .env
ports:
- "8888:8888"
command: "jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token=''"
volumes:
- $PWD:/usr/data_processing
working_dir: "/usr/data_processing"
profiles: ["jupyter_notebook"]
volumes:
db:
driver: local
elasticsearch:
driver: local
jupyter_notebook:
driver: local
Jupyter Notebook Dockerfile:
FROM python:3.10-slim-bullseye
ENV POETRY_VIRTUALENVS_CREATE=false \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
USER root
RUN apt-get update && \
apt-get -y install --no-install-recommends \
default-jdk \
gcc \
libc-dev \
libpq-dev \
python3-dev
COPY ../poetry.lock ./poetry.lock
COPY ../pyproject.toml ./pyproject.toml
RUN pip install poetry
RUN poetry install --no-interaction
CMD "jupyter notebook --port=8888 --no-browser --ip=0.0.0.0 --allow-root --NotebookApp.token=''"
My dependencies for Jupyter Notebook are managed via Poetry. My pyproject.toml
file:
[tool.poetry]
name = "data_processing"
version = "1.0.0"
description = ""
authors = ["Data team"]
[tool.poetry.dependencies]
python = "^3.10"
numpy = "1.24.*"
pandas = "1.5.*"
psycopg2 = "2.*"
pyspark = "3.*"
sqlalchemy = "2.*"
[tool.poetry.dev-dependencies]
autoflake = "2.*"
black = {extras = ["jupyter"], version = "22.*"}
elasticsearch = "8.*"
isort = "5.*"
jupyter = "*"
pre-commit = "3.*"
pytest = "7.*"
pyupgrade = "3.*"
pyyaml = "6.*"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
My directory structure:
- docker:
- docker-compose.yml
- Dockerfile_Jupyter
- Makefile
- poetry.lock
- pyproject.toml
- notebook.ipynb
- elasticsearch-spark-30_2.12-8.6.0.jar
I am running Docker Compose using Makefile:
start:
COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 \
docker compose --file docker/docker-compose.yml --profile jupyter_notebook up --build --detach
Hi @j-adamczyk
Spark JARs functionality seems to be working since jars are placed in /jars
directory. I would try with a different JAR in order to check if the issue is related to the JAR itself or jupyter notebook.
It seems a very specific use case difficult to reproduce on our side and very tied to your scenario.
For information regarding the application itself, customization of the content within the application, or questions about the use of the technology or infrastructure; we highly recommend checking forums and user guides made available by the project behind the application or the technology.
That said, we will keep this ticket open until the stale bot closes it just in case someone from the community adds some valuable info.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Name and Version
bitnami/spark
What steps will reproduce the bug?
I added Postgres and Elasticsearch jars to Spark, following the documentation:
In Jupyter Notebook (Python) I have tried:
What is the expected behavior?
Find appropriate jars and connect.
What do you see instead?
I receive an error:
Additional information
org.elasticsearch.spark.sql
instead ofes
, same error.print(spark.sparkContext._jsc.sc().listJars())
confirms that no additional jars are loadeddocker exec -it dockerID /bin/bash
confirmed that jars are indeed in/jars
directory. They have the same permissions as regular Spark jars, except for public write. I changed that viachmod 777
, but result is the same.