Open fienlag opened 8 months ago
@bryanck seems the issue is caused because no rewrite_manifests operation is run after commit https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests. Could you suggest if adding such step can resolve the issue ?
Once other table is create from this table like
create new_table as select * from iceberg_table
the new table can be queried with normal performance.
@fienlag ,checkout iceberg table maintenance , rewrite data files and rewrite manifests ["small files problem"] , can expire unnecessary snapshots aswell
Hi everyone!
I've created Iceberg Kafka connector, that consumes avro data and loads it to AWS S3 with auto creation of Athena table. The problem is that querying Athena takes an abnormal amount of time.
In my case:
data/
prefixdata/
prefix 2.4GB (actually there are less data files with my Kafka data, because some files contains only sets of identifiers and some contains paths)metadata/
prefix contain ~170 files (~100 files of metadata.json and other are .avro files)For example, I tried to run query
select count(distinct id) from db.iceberg_table
and it takes 18-20 minutes to execute. As I see, my connector perform loading data hourly, so every time it creates snapshot. Therefore, I manually set following property for my table 'vacuum_max_snapshot_age_seconds' = '7200' (2 hours) and performVACUUM db.iceberg_table
andOPTIMIZE db.iceberg_table REWRITE DATA USING BIN_PACK
. This queries decreased snapshot count to 2 and data files count/size to the numbers I previously mentioned. After this optimization query execution time did not decrease.Here is connector config:
Here is table DDL:
I hope you can help me understand why this is happening and how I can fix this issue with long query execution. Thank you :)