Closed bithw1 closed 2 weeks ago
yeah, if there is no record key definition it will fall back to pk-less, whereas it's INSERT.
yeah, if there is no record key definition it will fall back to pk-less, whereas it's INSERT.
Thanks @danny0405 for the reply, but I don't get what you said.
With the two default options: hoodie.datasource.insert.dup.policy=none
and hoodie.spark.sql.insert.into.operation=insert
,
what is the behavior if record key definitions have been defined, insert or upsert?
The operation is taken the highest priority if it is set up explicitly, so it is still INSERT instead of default upsert
.
@bithw1 Closing this. Please reopen in case of any more queries.
I looked into the source code of Hudi 0.15, and find the option
hoodie.sql.insert.mode
is deprecated.Prior to hudi 0.14.0 ,the default behavior for spark sql insert into statement is doing
upsert
which will not introduce duplicates.In the doc ,it says that this option is replaced with two options:hoodie.spark.sql.insert.into.operation
andhoodie.datasource.insert.dup.policy
.The definition for
hoodie.spark.sql.insert.into.operation
is as follows, the default value for the spark.sql.insert.into.operation has been changed toinsert
.The definition for
hoodie.datasource.insert.dup.policy
is as follows, the default value isnone
With default value for the above two options, it means that the default behavior of spark sql insert into statement has been changed for
upsert
toinsert
(may introduce duplicates)?I am not sure whether I have understood correctly. If i am correct, then this change is a broken change, some people using older version are using spark sql insert into to do upsert which will not introduce duplicates, but after upgrading to 0.14.0+, the default behavior is using insert which may introduce duplicates?