Closed osopardo1 closed 6 months ago
The answer to the first question is negative, Qbeast does not use these parameters. But the next question is what contract we should implement. Delta does the following:
txnAppI
and txnVersion
can be specified in the write options or even in the Spark session. The session scoped values are used as a fallback.The problem with Delta approach is that using old transaction identifier has side effects, moreover it is not clear why one needs to specify it at the session level. The role of checkpoints should also be clarified. The Delta approach seems unclear and complicated.
I suggest to implement something simpler like the following:
If we agree on having a simpler contract then I am not sure if it a good idea to use the same option names, because their semantics will slightly be different. @osopardo1 @cugni what do you think?
I agree with the first three steps, although I am unfamiliar with the side effects of ignoring the SparkSession
.
Checkpointing should be studied, because actual users need it.
The feature is implemented for main-1.0.0.
From Delta Lake documentation, we notice that:
Delta tables support the following
DataFrameWriter
options to make writes to multiple tables within foreachBatch idempotent:txnAppId
: A unique string that you can pass on each DataFrame write. For example, you can use the StreamingQuery ID as txnAppId.txnVersion
: A monotonically increasing number that acts as a transaction version.The user can pass these commit options from the batch processing of a stream:
Overall, I think we should:
DeltaOptions
passed to theSparkDeltaMetadataWriter
only includes "path", which also makes us think we have a bug in the way we treatrearrangeOnly
or other configuration/writing parameters).