apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.37k stars 2.2k forks source link

Improve `All` Metadata Tables with Snapshot Information #8856

Open RussellSpitzer opened 1 year ago

RussellSpitzer commented 1 year ago

Feature Request / Improvement

Currently all versions of metadata tables have the exact same schema as their not "all" versions. This is actually not very useful if you are attempting to locate the state of a particular entry at a specific time because the snapshot_id always just shows the file's original snapshot.

For example the entries table looks like

scala> spark.sql("SELECT * FROM db.timezoned.entries").show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
+------+-------------------+---------------+--------------------+--------------------+--------------------+
|status|        snapshot_id|sequence_number|file_sequence_number|           data_file|    readable_metrics|
+------+-------------------+---------------+--------------------+--------------------+--------------------+
|     1|6561920950175488866|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|

And all_entries looks like

scala> spark.sql("SELECT * FROM db.timezoned.all_entries").show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
+------+-------------------+---------------+--------------------+--------------------+--------------------+
|status|        snapshot_id|sequence_number|file_sequence_number|           data_file|    readable_metrics|
+------+-------------------+---------------+--------------------+--------------------+--------------------+
|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|6561920950175488866|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     1|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     0|6561920950175488866|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     0|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     0|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     0|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|     0|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
+------+-------------------+---------------+--------------------+--------------------+--------------------+

By looking at all_entries it is impossible for me to determine whether or not all the manifests were rewritten at once, or if they were rewritten in groups. Ideally we would see something like

+-------------------+-------------+------+-------------------+---------------+--------------------+--------------------+--------------------+
|              as_of|         time|status|        snapshot_id|sequence_number|file_sequence_number|           data_file|    readable_metrics|
+-------------------+-------------+------+-------------------+---------------+--------------------+--------------------+--------------------+
|7179885233531513409|1697493267302|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2750236691316126600|1697493268363|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2750236691316126600|1697493268363|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2517256618694516958|1697493269568|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2517256618694516958|1697493269568|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2517256618694516958|1697493269568|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|5535987506380389562|1697493270419|     1|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|5535987506380389562|1697493270419|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|5535987506380389562|1697493270419|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|5535987506380389562|1697493270419|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|6561920950175488866|1697493271193|     1|6561920950175488866|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|6561920950175488866|1697493271193|     1|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|6561920950175488866|1697493271193|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|6561920950175488866|1697493271193|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|6561920950175488866|1697493271193|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2315221993819944328|1697497673027|     1|6561920950175488866|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2315221993819944328|1697497673027|     1|5535987506380389562|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2315221993819944328|1697497673027|     1|2517256618694516958|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2315221993819944328|1697497673027|     1|2750236691316126600|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
|2315221993819944328|1697497673027|     1|7179885233531513409|              0|                   0|{0, /Users/russel...|{{51, 1, 0, null,...|
+-------------------+-------------+------+-------------------+---------------+--------------------+--------------------+--------------------+

Apologies that my code is not getting status status correctly but basically the idea being that we add on two columns (names are still up for debate) "as_of_snapshot", "as_of_time"

These would allow us to be able to analyze the actual history in all_entires and all_manifests

Query engine

None

nk1506 commented 11 months ago

@RussellSpitzer , Is it fine, If I will continue on this ? If so Please assign this to me.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

manuzhang commented 3 weeks ago

There is an in-progress PR #9335 with activity as recent as last week.