delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.62k stars 1.71k forks source link

How to connect spark on delta mode and read/write data remote? #3864

Open majidebraa opened 1 week ago

majidebraa commented 1 week ago

this is my docker-compose file:

version:` '3.8'
services:
  spark-master:
    image: bitnami/spark
    container_name: spark-master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_WEBUI_PORT=8080
      - SPARK_MASTER_PORT=7077
      - SPARK_SUBMIT_OPTIONS=--packages io.delta:delta-spark_2.12:3.2.0
      - SPARK_MASTER_HOST=spark-master
    ports:
      - 8080:8080
      - 7077:7077
    networks:
      - spark-network
    volumes:
      - ./files:/mnt

  spark-connect:
    image: bitnami/spark
    container_name: spark-connect
    environment:
      - SPARK_MODE=driver
      - SPARK_MASTER=spark://spark-master:7077
    ports:
      - "15002:15002"
    networks:
      - spark-network
    depends_on:
      - spark-master
    command: ["/bin/bash", "-c", "/opt/bitnami/spark/sbin/start-connect-server.sh --master spark://spark-master:7077 --packages org.apache.spark:spark-connect_2.12:3.5.1"]
    volumes:
      - ./files:/mnt

  spark-worker:
    image: bitnami/spark
    container_name: spark-worker
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_WEBUI_PORT=8081
    ports:
      - 8081:8081
    depends_on:
      - spark-master
    networks:
      - spark-network

  spark-worker2:
    image: bitnami/spark
    container_name: spark-worker2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_WEBUI_PORT=8082
    ports:
      - 8082:8082
    depends_on:
      - spark-master
    networks:
      - spark-network
networks:
  spark-network:

I want to connect to this container for read/write data on delta mode, with this code:

data = spark.range(0, 5)
data.write.format("delta").save("/tmp/deltars_table")
# Write DataFrame to Delta Lake in 'delta' format
df.write.format("delta").mode("overwrite").save("/tmp/delta_table")

but get error:

Traceback (most recent call last):
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\main.py", line 52, in <module>
    df.write.format("delta").mode("overwrite").save("/tmp/delta_table")
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\readwriter.py", line 601, in save
    self._spark.client.execute_command(self._write.command(self._spark.client))
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\client\core.py", line 982, in execute_command
    data, _, _, _, properties = self._execute_and_fetch(req)
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\client\core.py", line 1283, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(req):
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\client\core.py", line 1264, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\client\core.py", line 1503, in _handle_error
    self._handle_rpc_error(error)
  File "E:\PycharmProjects\delta_lake_test\delta_lake_test\.venv\lib\site-packages\pyspark\sql\connect\client\core.py", line 1539, in _handle_rpc_error
    raise convert_exception(info, status.message) from None
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed to find the data source: delta. Please find packages at `https://spark.apache.org/third-party-projects.html`.

How can I make a connection string to remote delta lake and use it from my host or server or other docker containers?

newfront commented 1 week ago

The release of Delta support for Spark Connect is coming with Delta 4.0.

https://delta.io/blog/delta-lake-4-0/#delta-connect-available-in-preview

I hope this helps.

majidebraa commented 1 week ago

The release of Delta support for Spark Connect is coming with Delta 4.0.

https://delta.io/blog/delta-lake-4-0/#delta-connect-available-in-preview

I hope this helps.

I try to use delta 4-preview but this get error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$LogStringContext
2024-11-09 15:35:30     at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
2024-11-09 15:35:30     at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3402)
2024-11-09 15:35:30     at java.base/java.lang.Class.getMethodsRecursive(Class.java:3543)
2024-11-09 15:35:30     at java.base/java.lang.Class.getMethod0(Class.java:3529)
2024-11-09 15:35:30     at java.base/java.lang.Class.getMethod(Class.java:2225)
2024-11-09 15:35:30     at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:42)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
2024-11-09 15:35:30     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-11-09 15:35:30 Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$LogStringContext
2024-11-09 15:35:30     at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
2024-11-09 15:35:30     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
2024-11-09 15:35:30     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
majidebraa commented 1 week ago

Without spark connect how can I use delta in spark with remote connection string?