apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.5k stars 2.25k forks source link

Best Practices for Storing and Querying Full History and Latest Versions #11618

Open Selinix opened 3 days ago

Selinix commented 3 days ago

Query engine

Spark for loading, Trino for querying

Question

Hi,

I’m looking for guidance on the most efficient solution for maintaining full history and querying the latest versions of events without maintaining redundant copies of the data.

A use case is to be able to query either:

  1. All versions of an event (e.g., SELECT * FROM full_hist WHERE id = 'XXX')
  2. Only the latest version of an event (e.g., SELECT * FROM latest_slice WHERE id = 'XXX')

The latest version is determined by the maximum value in a version field for each id.

Questions

  1. Is it better to maintain:
    • A single table with full history and periodically deduplicate it into a separate latest_slice table?
    • Or a single full history table with a view that computes the latest versions dynamically?
  2. If the latter, does applying optimization techniques like partitioning, sorting, and ordering on the full history table significantly improve performance for querying the latest versions?
  3. Given the preference to store only one copy of the data, what is the most performant and practical solution for this scenario?

Thank you for your guidance!