[bitnami/spark] Troubleshooting Apache Spark Connect Server with Docker Compose

parbatrajpurohit commented 1 month ago

Name and Version

bitnami/spark:3.5.1

What architecture are you using?

None

What steps will reproduce the bug?

Here's a Docker Compose setup for a distributed Apache Spark environment using Bitnami's Spark image. It includes:

spark-master: Runs the Spark master node on ports 8080 (web UI) and 7077.
spark-connect: Starts a Spark connect server on port 15002, dependent on the master.
spark-worker: A Spark worker node with 2 cores and 2GB memory, on port 8081.
spark-worker2: Another worker node with similar specs, on port 8082.

All services are connected via a custom network and use a shared volume for data.

Once I submit a task to the Spark Connect server on port 15002 from my local machine, the Spark master distributes the workload to the workers. After some time, I can see the output in the PyCharm console. However, the application continues to run on the master, and the workers keep processing it.

To resolve this, I need to manually kill the application. If I try to run a new application after this, I encounter a gRPC error with a status code of 2, indicating an unknown error.

`version: '3.8'
services:
  spark-master:
    image: bitnami/spark
    container_name: spark-master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_WEBUI_PORT=8080
      - SPARK_MASTER_PORT=7077
      - SPARK_SUBMIT_OPTIONS=--packages io.delta:delta-spark_2.12:3.2.0
      - SPARK_MASTER_HOST=spark-master
    ports:
      - 8080:8080
      - 7077:7077
    networks:
      - spark-network
    volumes:
      - /mnt/f/Thesis_Docs/Project/spark:/mnt

  spark-connect:
    image: bitnami/spark
    container_name: spark-connect
    environment:
      - SPARK_MODE=driver
      - SPARK_MASTER=spark://spark-master:7077
    ports:
      - 15002:15002
    networks:
      - spark-network
    depends_on:
      - spark-master
    command: ["/bin/bash", "-c", "/opt/bitnami/spark/sbin/start-connect-server.sh --master spark://spark-master:7077 --packages org.apache.spark:spark-connect_2.12:3.5.1"]
    volumes:
      - /mnt/f/Thesis_Docs/Project/spark:/mnt

  spark-worker:
    image: bitnami/spark
    container_name: spark-worker
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - 8081:8081
    depends_on:
      - spark-master
    networks:
      - spark-network

  spark-worker2:
    image: bitnami/spark
    container_name: spark-worker2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_WEBUI_PORT=8082
    ports:
      - 8082:8082
    depends_on:
      - spark-master
    networks:
      - spark-network
networks:
  spark-network:`

Python Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

# Perform your Spark operations
spark.range(15).show()
# Stop Spark session
spark.stop()

What is the expected behavior?

The expected behavior is for the Spark master to coordinate the distribution of tasks to the Spark workers, and the Spark Connect server to allow for the submission of tasks from your local machine.

However, there are a few issues and points to address to ensure the system functions correctly and prevents the gRPC error with status code 2.

Also once the application is completed in pycharm and it shows the result in the pycharm console it should show the task in list of completed application on the WebUI too

What do you see instead?

Additional information

No response

javsalgar commented 1 month ago

Hi!

Did you try with other examples? Just to discard it is not an issue with the example job

parbatrajpurohit commented 1 month ago

Upon troubleshooting, I discovered why the application keeps running in the Spark Master Web UI, even though it shows as completed on the Spark Connect server.

The issue arises because the Spark Master Web UI treats the Spark Connect server as an application that runs indefinitely until we manually stop or kill the Spark Connect Docker container.

Is there a way to prevent this, so that only the applications I provide from my IDE appear in the Master Web UI and go through the Spark Connect server? This way, there will be a single connection point for the full stack through the Spark Connect server on port 15002 and spark connect server will give task to the Spark-Master.

javsalgar commented 1 month ago

Hi,

It is not clear to me that this is an issue related with the Bitnami packaging of Spark or with the Spark application itself. Did you try showing this case to the upstream Spark devs?

github-actions[bot] commented 3 weeks ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 2 weeks ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

bitnami / containers