apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

MERGE INTO number of affected rows #8229

Open asp437 opened 1 year ago

asp437 commented 1 year ago

Query engine

Spark (SparkSQL)

Question

I use MERGE INTO SparkSQL query to update values in Iceberg table via Spark with some condition in WHEN MATCH clause. And I want to collect metrics of such queries to track how many rows were updated/inserted/deleted.

Is there a way to get a number of rows affected by MERGE INTO query?

I tried to look on snapshot information but it contains number of rows in affected PARQUET files. And this values are expected to be much higher e.g. in case of query updating half of rows, but due to storage configuration all data files were rewritten. This is useful metric too, but I also need a number of rows changed based on query logic, not physical representation.

szehon-ho commented 1 year ago

I dont think we capture stats for those (only how much newly written files/records), but you can run https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view procedure to see exact changes between any two operations

asp437 commented 1 year ago

Thank you for reply. Will take a look on this changelog view.

Yeah, I've took a look on this code and looks like it is the only place such stats can be calculated. Outside MergeRowsExec all rows of affected partitions will be in iterator.

Do you think it is possible to add collecting such stats into the Iceberg? I'm not familiar with Spark internal and not sure about the details, asking about the idea in general. Maybe I will try to dive deeper in case it is possible and can be used by others.

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.