Open imonteroq opened 8 months ago
@imonteroq You are correct. I was able to reproduce this. Here is full reproducible example - https://gist.github.com/ad1happy2go/db5813b8bd8d5f7142c2f9b0b2f29922
Created Tracking JIRA to fix the same - https://issues.apache.org/jira/browse/HUDI-7319
Thank you @ad1happy2go. One question while we are at it, will Hudi ever make use of Spark's checkpointing to carry out managed incremental ingestion? So far it seems from the documentation that a commit timestamp should be provided.
Sorry can you elaborate on that.
Yes, spark streaming should use the checkpointing in order to resume the stream. Please clarify in case i am missing anything here.
Sorry can you elaborate on that.
Yes, spark streaming should use the checkpointing in order to resume the stream. Please clarify in case i am missing anything here.
Apologies, my bad, I omitted part of my question. I meant incremental ingestion from another Hudi table using Spark's Structure Streaming. I've tested this functionality and it will always load the full table unless I pass in a commit timestamp. I can raise another issue to make my point better.
@imonteroq Ideally it should use the checkpoint details although not very sure, one reason may be cleaner configuration. Are you setting below config.
https://hudi.apache.org/docs/configurations/#hoodiedatasourcereadincrfallbackfulltablescanenable
@imonteroq Any updates on this.
Describe the problem you faced Streaming in Spark from a Hudi table fails with the error below when a
writeStream
process has created / written to the table with the schema evolution settingshoodie.schema.on.read.enable
&hoodie.datasource.write.reconcile.schema
on. I have not been able to upsert a source schema containing either more columns and/or fewer columns than the target schema without this two settings enabled.org.apache.spark.sql.AnalysisException: [COLUMN_ALREADY_EXISTS] The column _hoodie_commit_seqno already exists. Consider to choose another name or rename the existing column.
To Reproduce
Environment Description
OS: Mac OS X
Hudi version: 0.14.0
Spark version: 3.4.1
Storage: S3 (LocalStack)
Running on Docker?: No
Additional context
This works fine with either
hoodie.schema.on.read.enable
orhoodie.datasource.write.reconcile.schema
disabled.Stacktrace