delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.65k stars 1.72k forks source link

[Question] Are fast parallel writes in Delta Tables on S3 possible? #2531

Open sebastiandaberdaku opened 10 months ago

sebastiandaberdaku commented 10 months ago

Are fast parallel writes in Delta Tables on S3 possible?

Which Delta project/connector is this regarding?

Overview

I have set up a Spark Connect kubernetes cluster (Spark v3.5.0) that reads from and writes to a set of Delta Tables (using Delta 3.0.0). I am using AWS S3 as storage, AWS Glue as metastore, and have configured S3DynamoDBLogStore for concurrent writes.

I have also carefully partitioned the tables, and I'm trying to operate concurrently and in parallel for different partitions. The problem is that I see the performance of the system degrade very quickly (in terms of Spark computations) as the parallel requests increase in number.

If one job alone requires 2 minutes to complete, two jobs slow down to 3 minutes, 10 jobs slow down to 15-18 minutes, even though the Spark cluster utilization is quite low. Increasing resources/executors does virtually nothing.

I believe the slowdown is due to the locking mechanism of the delta tables.

I even tried the S3SingleDriverLogStore but no improvements.

I also tried disabling the snapshotCache (setting spark.databricks.delta.snapshotCache.storageLevel to NONE) to no avail.

Question

Are fast parallel writes in Delta Tables possible at all on S3? Are there other configurations that I can explore that could help me in my quest?

MrPowers commented 4 months ago

Perhaps multi-cluster writes will help here: https://delta.io/blog/2022-05-18-multi-cluster-writes-to-delta-lake-storage-in-s3/

Let me know if you find that blog useful!