Open asp437 opened 1 year ago
I dont think we capture stats for those (only how much newly written files/records), but you can run https://iceberg.apache.org/docs/latest/spark-procedures/#create_changelog_view procedure to see exact changes between any two operations
Thank you for reply. Will take a look on this changelog view.
Yeah, I've took a look on this code and looks like it is the only place such stats can be calculated. Outside MergeRowsExec
all rows of affected partitions will be in iterator.
Do you think it is possible to add collecting such stats into the Iceberg? I'm not familiar with Spark internal and not sure about the details, asking about the idea in general. Maybe I will try to dive deeper in case it is possible and can be used by others.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Query engine
Spark (SparkSQL)
Question
I use
MERGE INTO
SparkSQL query to update values in Iceberg table via Spark with some condition inWHEN MATCH
clause. And I want to collect metrics of such queries to track how many rows were updated/inserted/deleted.Is there a way to get a number of rows affected by
MERGE INTO
query?I tried to look on snapshot information but it contains number of rows in affected PARQUET files. And this values are expected to be much higher e.g. in case of query updating half of rows, but due to storage configuration all data files were rewritten. This is useful metric too, but I also need a number of rows changed based on query logic, not physical representation.