Partitions metadata table shows old partitions

gaborkaszab commented 1 year ago

Apache Iceberg version

main (development)

Query engine

Spark

Please describe the bug 🐞

I created a simple partitioned table with an integer partition column 'i'. Inserted one row where i=1. Then updated the table to set i=2. When querying the partitions metadata table I see 2 rows as the old one is also there. Precondition to reproduce this to have a v2 table with merge-on-read update mode.

Steps to reproduce:

create table demo.db.tmptbl (i int) partitioned by (identity(i)) tblproperties('format-version'='2','write.update.mode'='merge-on-read','write.delete.mode'='merge-on-read')

insert into demo.db.tmptbl values (1)

update demo.db.tmptbl set i = 2 where i = 1

select * from demo.db.tmptbl.partitions
+---------+------------+----------+-------+
|partition|record_count|file_count|spec_id|
+---------+------------+----------+-------+
|{2}      |1           |1         |0      |
|{1}      |1           |1         |0      |
+---------+------------+----------+-------+

When I run a delete instead of an update the dropped partition is not present in the output as I expected. delete from demo.db.tmptbl where i = 1

gaborkaszab commented 1 year ago

cc @ajantha-bhat @wypoon @szehon-ho

bondarenko commented 1 year ago

Looks like without USING iceberg you don't create iceberg table and so it doesn't have to have even update support not speaking about partitions table

gaborkaszab commented 1 year ago

Looks like without USING iceberg you don't create iceberg table and so it doesn't have to have even update support not speaking about partitions table

Hey @bondarenko You're saying that you weren't able to repro the issue with the SQL in the description? I definitely had an Iceberg table, was able to update and also to query metadata tables (that are Iceberg specific). There might be a setting to default the CREATE TABLE to Iceberg, I don't know.

gaborkaszab commented 1 year ago

The "old" partitions are present in the output of a partition metadata table query when the table has merge-on-read for updates. This is because for the partition metadata table the delete files are not applied to the data files to see what partitions are remaining. I think this is a bit misleading because now querying the partitions metadata table can have different results depending on copy-on-write or merge-on-read was used.

In order to fix this the delete files should be applied but that would break the idea of metadata table queries being lightweight and only requiring table metadata to produce the output. Regardless, I still think fixing this by applying the delete files would make sense (behind a flag?). Could you please share your thoughts, @szehon-ho @rdblue @danielcweeks ?

szehon-ho commented 1 year ago

What would the algorithm be? If the partition has delete files, try to do a full MOR, and check if records are null? Personally, sounds a bit extreme, I would think a good first step is just add a column for delete_files (It may be easier after my new change in #6365). After all, we do have a partition existing, just its of invalid delete files. Interested to hear others thoughts as well.

gaborkaszab commented 1 year ago

What would the algorithm be? If the partition has delete files, try to do a full MOR, and check if records are null? Personally, sounds a bit extreme, I would think a good first step is just add a column for delete_files (It may be easier after my new change in #6365). After all, we do have a partition existing, just its of invalid delete files. Interested to hear others thoughts as well.

Well, giving this a second (and a third) thought I have to admit that applying delete file on the data files to get the partitions is too heavyweight. I'm wondering if we should document this behaviour somewhere as I remember on Slack there was someone confused about the 'record_count' column of the metadata table not adding up to the same value what count(*) gives.

szehon-ho commented 1 year ago

Yea I admit that is annoying. Maybe just the fact to add delete_files column will help know that perhaps the record_count may change? (As well as documenting of course). But agree there's no good way without applying, to get the real record count.

gaborkaszab commented 1 year ago

Thanks for the feedback, @szehon-ho! About documenting this phenomenon: I found the spark-queries page that talks about the partition metadata table. Here I can add a short remark about the "old" partition rows. I'm wondering if there is another part of the Iceberg docs that are not Spark specific and covers the metadata tables.

szehon-ho commented 1 year ago

Hi yes, sounds good to me. I think we are doing a doc per engine, and so right now I believe it's only there for Spark. Ref some earlier discussions on how to document metadata tables. https://github.com/apache/iceberg/issues/757

szehon-ho commented 1 year ago

Hi @gaborkaszab sorry i was just re-reading this issue and had a question on the use-case, do you know why it doesnt use a metadata delete, to remove the partition without delete-files? ref it should return true here, and then go to a code path where it deletes the manifest entry , rather than write a delete file. https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L281

gaborkaszab commented 5 months ago

Sorry, I missed this @szehon-ho more than a year ago :) I haven't went into this to be honest. Maybe pos delete was written instead of a metadata-only operation because this was an UPDATE and not a DELETE that reproed this phenomenon? In fact, I recall when I ran DELETE, the use-case didn't repro this as it was metadata-only op.

apache / iceberg