Open caican00 opened 2 weeks ago
Finally, I still choose to implement Iceberg time travel
by overriding newScanBuilder
, for the following reasons:
Although SparkIcebergTable
extended SparkTable
, it still needs to initialize its member variables, such as snapshotId
or branch
, before it can directly reuse the newScanBuilder
implementation of SparkTable
.
However, initializing snapshotId
or branch
is difficult, not as easy as initializing refreshEagerly
, because it is difficult to determine snapshotId
or branch
is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId
or branch
through the super
method.
In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199
newScanBuilder
is more convenient and does not introduce too much maintenance burden.Finally, I still choose to implement Iceberg
time travel
by overridingnewScanBuilder
, for the following reasons:
- Although
SparkIcebergTable
extendedSparkTable
, it still needs to initialize its member variables, such assnapshotId
orbranch
, before it can directly reuse thenewScanBuilder
implementation ofSparkTable
.- However, initializing
snapshotId
orbranch
is difficult, not as easy as initializingrefreshEagerly
, because it is difficult to determinesnapshotId
orbranch
is initialized in the real sparkTable. Therefore, it is difficult to selectively initializesnapshotId
orbranch
through thesuper
method.In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199
- Therefore, overriding
newScanBuilder
is more convenient and does not introduce too much maintenance burden.
cc @FANNG1
How about Override loadTable(Identifier ident, String version)
and loadTable(Identifier ident, String version)
for GravitinoIcebergCatalog
?
How about Override
loadTable(Identifier ident, String version)
andloadTable(Identifier ident, String version)
forGravitinoIcebergCatalog
?
in this way, we still have the problem of initializing the snapshotId
or branch
when invoking super
method of SparkTable
.
How about Override
loadTable(Identifier ident, String version)
andloadTable(Identifier ident, String version)
forGravitinoIcebergCatalog
?in this way, we still have the problem of initializing the
snapshotId
orbranch
when invokingsuper
method ofSparkTable
.
Let me think a while
fixed conflict, and could you please help review again if you are free? Thank you.
What changes were proposed in this pull request?
Support Iceberg time travel in SQL queries
Why are the changes needed?
supports time travel in SQL queries using
TIMESTAMP AS OF
,FOR SYSTEM_TIME AS OF
orVERSION AS OF
,FOR SYSTEM_VERSION AS OF
clauses.Fix: https://github.com/datastrato/gravitino/issues/3264
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New ITs.