apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

[SUPPORT] two insert into operation works like upsert #11996

Closed bithw1 closed 2 weeks ago

bithw1 commented 3 weeks ago

Hi,

I am using Hudi 0.15.0, In the spark-sql cli, I do the following things. I insert two records with two insert clause, I think there will be two records(with same id),but only one is left there, looks Hudi does upsert instead of insert here,

The default behavior of insert into is insert, so I don't understand how upsert happens here

CREATE TABLE IF NOT EXISTS hudi_cow_19 (
  a INT,
  b INT,
  c INT
) 

USING hudi

tblproperties(
type='cow',
primaryKey='a',
hoodie.datasource.write.precombine.field='c',
hoodie.merge.allow.duplicate.on.inserts='true'

);

insert into hudi_cow_19 select 1,1,1;     --- insert the first record
select * from hudi_cow_19;
insert into hudi_cow_19 select 1,10,10;  --- insert the second record with same key as the first one
select * from hudi_cow_19;
KnightChess commented 3 weeks ago

@bithw1 because you set precombine field, you can set set hoodie.combine.before.insert = false;

danny0405 commented 3 weeks ago

Eliminate the primary key definition, that is what we call a pk-less table,

bithw1 commented 3 weeks ago

@bithw1 because you set precombine field, you can set set hoodie.combine.before.insert = false;

Thanks, but it doesn't work for me..actually, hoodie.combine.before.insert is false by default.

  public static final ConfigProperty<String> COMBINE_BEFORE_INSERT = ConfigProperty
      .key("hoodie.combine.before.insert")
      .defaultValue("false")
      .markAdvanced()
      .withDocumentation("When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before"
          + " writing to storage.");
bithw1 commented 3 weeks ago

Eliminate the primary key definition, that is what we call a pk-less table,

Thanks, I tried and it works for me!

So, can i conclude that with pk definition and precombine field, insert operation will work like upsert?

KnightChess commented 3 weeks ago

@bithw1 the default value will be modified when running job if not specified.

bithw1 commented 3 weeks ago

@bithw1 the default value will be modified when running job if not specified.

@KnightChess I have set this option explicitly per your guide,but I still see the same result(one updated record instead of two records)

KnightChess commented 2 weeks ago

@bithw1 my mistake, the hoodie.combine.before.insert is controll the imcoming records, your insert is two sql. you can set set hoodie.spark.sql.insert.into.operation = insert for pk table too.

bithw1 commented 2 weeks ago

Thanks @KnightChess