Spark structured streaming schema evolution

richiesgr commented 1 year ago

Hi @kevinwallimann First I'm aware of the issue

I used abris to stream avro to delta table. As mentioned in the issue the schema evolution is not working neither using streaming or foreachbatch.

However using this code I'm able to update the schema for each message at least in the source the problem is that is never reflected on the sink (can be easily adapted to use abris by the way)

You said spark can't handle this out of the box I confirm but do you've any idea how to implement it ? Thanks

Orpheuz commented 1 year ago

Hey @richiesgr, the code I used in my examples is really similar to ABRiS, I think it might have some enhancements regarding Enum types but overall it should be very similar. I also have no idea how to develop a workaround but it should be around restarting/replanning the streaming job.

kevinwallimann commented 1 year ago

Hi @richiesgr Thanks for your question. In #176, I wrote that schema evolution is not possible during query execution, however that does not mean that schema evolution is generally impossible. However, you have to stop your query and restart it, and thus let Spark generate a new execution plan with the new schema. This is what we usually do to support schema evolution.

What I meant in #176 with making Spark change its execution plan even during a long-running Structured Streaming query, is based on the realization that in micro batch mode, Spark actually creates a series of query executions which all get a new instance of the execution plan, see here: https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L647 You should also be able to see execution plans per micro batch in the Spark UI. So theoretically, maybe that execution plan could be changed across micro batches. As this is in no way supported by Spark, basically you'd have to fork Spark, reimplement this part and maintain it for future versions of Spark. Even if you succeed doing this, there is no way to control at which message exactly to create a new Microbatch to change the execution plan, so you'd have to somehow reimplement this part, too.

So basically, you'd have to implement your own fork of Spark to support in-flight schema evolution without stopping and restarting the Spark query

AbsaOSS / ABRiS

Spark structured streaming schema evolution #319