I’m looking for guidance on the most efficient solution for maintaining full history and querying the latest versions of events without maintaining redundant copies of the data.
A use case is to be able to query either:
All versions of an event (e.g., SELECT * FROM full_hist WHERE id = 'XXX')
Only the latest version of an event (e.g., SELECT * FROM latest_slice WHERE id = 'XXX')
The latest version is determined by the maximum value in a version field for each id.
Questions
Is it better to maintain:
A single table with full history and periodically deduplicate it into a separate latest_slice table?
Or a single full history table with a view that computes the latest versions dynamically?
If the latter, does applying optimization techniques like partitioning, sorting, and ordering on the full history table significantly improve performance for querying the latest versions?
Given the preference to store only one copy of the data, what is the most performant and practical solution for this scenario?
Query engine
Spark for loading, Trino for querying
Question
Hi,
I’m looking for guidance on the most efficient solution for maintaining full history and querying the latest versions of events without maintaining redundant copies of the data.
A use case is to be able to query either:
SELECT * FROM full_hist WHERE id = 'XXX'
)SELECT * FROM latest_slice WHERE id = 'XXX'
)The latest version is determined by the maximum value in a
version
field for eachid
.Questions
latest_slice
table?Thank you for your guidance!