Open watfordkcf opened 1 year ago
After digging into Spark's source code, it appears that each structured stream receives an isolated SparkSession
, meaning it would be safe to set idempotent details within a microbatch. This is not explicitly called out anywhere obvious in the Spark documentation...
/** Isolated spark session to run the batches with. */
private val sparkSessionForStream = sparkSession.cloneSession()
Which is then passed to:
/**
* Processes any data available between `availableOffsets` and `committedOffsets`.
* @param sparkSessionToRunBatch Isolated [[SparkSession]] to run this batch with.
*/
private def runBatch(sparkSessionToRunBatch: SparkSession): Unit = {
+1. Thank you for providing a workaround.
Hi, can I confirm if its safe to set the idempotency spark config in each of the foreachbatch itself?
def x_batch(x_df: DataFrame, batch_id: int) -> None:
spark = x_df.sparkSession
spark.conf.set("spark.databricks.delta.write.txnAppId", f"x-{SETTINGS.version.major}")
spark.conf.set("spark.databricks.delta.write.txnVersion", batch_id)
// do stuff including a merge
def y_batch(y_df: DataFrame, batch_id: int) -> None:
spark = y_df.sparkSession
spark.conf.set("spark.databricks.delta.write.txnAppId", f"y-{SETTINGS.version.major}")
spark.conf.set("spark.databricks.delta.write.txnVersion", batch_id)
// do stuff including a merge
x_kafka_topic.writeStream.forEachBatch(x_batch)
y_kafka_topic.writeStream.forEachBatch(y_batch)
Just like how you did it here ?
Question
Which Delta project/connector is this regarding?
Overview
We have some logic that looks like this:
We would like to use idempotent writes for each of these batches, however, because we're using a Delta Merge operation within the
forEachBatch
, we have to set the idempotency at theSparkSession
conf level.Is this safe? It isn't obvious that it would be:
Motivation
I noticed in the original feature request asking for this that
.option(...)
would be added toDeltaMergeBuilder
, but it is clear from the implementation that it did not get added.If this is not safe to do, then I would change this question to a feature request and allow options to be set on the
DeltaMergeBuilder
.Further details
Spark 3.4.0 Delta 2.4.0
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?