apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.1k stars 2.13k forks source link

Can sparksql ddl define primary key now? #8508

Open waywtdcc opened 11 months ago

waywtdcc commented 11 months ago

Query engine

spark

Question

Now can sparksql ddl define primary key, and attributes can also be used?

ConeyLiu commented 11 months ago

You could check the doc here: https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--set-identifier-fields

jia-zhengwei commented 9 months ago

You could check the doc here: https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--set-identifier-fields

@ConeyLiu What should I do after setting IDENTIFIER FIELDS ? I can still insert into duplicate rows by spark-sql cmd;

image

ConeyLiu commented 9 months ago

You should use MERGE INTO to upsert in Spark. https://iceberg.apache.org/docs/latest/spark-writes/#merge-into

zhangbutao commented 9 months ago

You should use MERGE INTO to upsert in Spark. https://iceberg.apache.org/docs/latest/spark-writes/#merge-into

Hi @ConeyLiu , IMHO, MERGE INTO has nothing to do with IDENTIFIER FIELDS? Right?

baiyangtx commented 9 months ago

IDENTIFIER FIELDS only works for Flink Streaming upsert.

ConeyLiu commented 9 months ago

Yes, IDENTIFIER FIELDS is mostly used in equality delete files. Right now only Flink has implemented the MOR with equality delete files.

zhangbutao commented 9 months ago

Got it. It seems that the document https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--set-identifier-fields will give users the misconception that Spark can useIDENTIFIER FIELDS to deduplicate data.

jia-zhengwei commented 9 months ago

Got it. It seems that the document https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--set-identifier-fields will give users the misconception that Spark can useIDENTIFIER FIELDS to deduplicate data.

@zhangbutao Yes, it confused me also.

@ConeyLiu @zhangbutao Do you have any suggestion for using spark with duplicates data? YES, I will see merge into above.

ConeyLiu commented 9 months ago

You should use MERGE INTO if you want to do upsert. INSERT INTO will append data instead of upsert data.

jia-zhengwei commented 9 months ago

You should use MERGE INTO if you want to do upsert. INSERT INTO will append data instead of upsert data.

Got it, Thanks.