Reading Deltas via Metadata

(this is very similar in philosophy to #47 and it would be good to read that before this, same caveats applying)

A job that wants to read only new data since the last time it ran must understand what the high-water mark was and read new data from its source based on a predicate. For instance:

val newData = spark
  .read
  /* ... */
  .filter($"day" === "2018-08-05")

However, we can base our reads on the building-up of snapshots along time, so if our snapshots are S₁, S₂, S₃ and S₄ and the last snapshot we processed was S₁, we can read the new data from S₂, S₃ and S₄ and skip the filtering completely. This would essentially make our high-water mark metadata-based, rather than data-based.

This can be achieved using the low-level Iceberg API, but not using the Spark API, which would be a great addition to the project.

Here's a sketch of how this API may look like:

spark
  .read
  .format("iceberg")
  .snapshots(2, 3, 4)
  .load(path)

Note: Specifying the list of snapshots would also let this API support other use cases, such as parallel-processing of snapshots, etc.

Netflix / iceberg

Reading Deltas via Metadata #48