delta-io / delta-kernel-rs

A native Delta implementation for integration with any query engine
Apache License 2.0
144 stars 41 forks source link

Add `LogSegmentBuilder` to construct `LogSegment`s #457

Closed OussamaSaoudi-db closed 23 hours ago

OussamaSaoudi-db commented 1 week ago

What changes are proposed in this pull request?

This pull request builds on #438 by creating a LogSegmentBuilder. This moves all logic for building a LogSegment out of Snapshot. This change is made in anticipation of TableChanges, which will represent CDFs and must construct its own LogSegment.

The builder allows you to specify the following:

How was this change tested?

New tests are added to check the following:

  1. Speciying start_version and start_checkpoint results in an error
  2. Non contiguous commit files should result in an error
  3. commit file ordering is correct in ascending and descending cases
  4. Checkpoint files are correctly omitted if with_omit_checkpoint_files is specified.
  5. start and end version work.
codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 92.15686% with 36 lines in your changes missing coverage. Please review.

Project coverage is 79.97%. Comparing base (bb1be88) to head (f8457d9).

Files with missing lines Patch % Lines
kernel/src/log_segment.rs 92.47% 24 Missing and 10 partials :warning:
kernel/src/snapshot.rs 71.42% 1 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #457 +/- ## ========================================== + Coverage 79.60% 79.97% +0.37% ========================================== Files 57 57 Lines 12330 12585 +255 Branches 12330 12585 +255 ========================================== + Hits 9815 10065 +250 - Misses 1987 1993 +6 + Partials 528 527 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

OussamaSaoudi-db commented 5 days ago

@nicklan @scovich @zachschuermann I was working on CDF and needed a way to keep track of the commit version. Currently, LogSegment holds commit_files: Vec<FileMeta>, and checkpoint_files: Vec<FileMeta>. This only holds the path and modification time. It would be really useful to keep all the commit version in ParsedLogFile. Here are some alternatives I considered:

How do you feel about LogSegment having commit_files: Vec<ParsedLogPath> instead of the current commit_files: Vec<FileMeta>? The opportunities I see are the following:

  1. CDF can directly get the commit version from the ParsedLogPath. We're certain this is the right commit number.
  2. The test I flagged above wants to have access to version, but we can't use build because that projects away version info. I think keeping version info and file_type from ParsedLogPath improves the testability of LogSegment and by extension, Snapshot.
  3. If we want to have logging and metrics observability, knowing commit version and file type may be useful. Consider a scenario where we want to monitor performance, and we notice an anomaly when replaying the log. It would be handy to instantly know the version number or file type instead of having to parse through the path name.

The main drawbacks I see is that we're stuck moving around more data, and the LogPathFileType enum may not be FFI friendly. This is also a relatively larger change, so maybe we can defer this and do the alternatives I considered above.

zachschuermann commented 5 days ago

Hey @OussamaSaoudi-db thanks for that message I broadly agree with your ideas and (without too much digging yet) sounds like doing both commit_files and checkpoint_files as vecs of ParsedLogPath makes a lot of sense. Perhaps we can do that as a quick pre-factor to this PR?

Regarding the downsides:

The main drawbacks I see is that we're stuck moving around more data, and the LogPathFileType enum may not be FFI friendly. This is also a relatively larger change, so maybe we can defer this and do the alternatives I considered above.

I doubt the marginal increase in size of the structs themselves will have meaningful impact (or, put another way, whatever impact is likely worth it). Do we expose LogSegments in FFI? right now it's pub(crate) with developer-visibility for pub right?

zachschuermann commented 2 days ago

Oh and note that LogSegmentBuilder is pub(crate) with developer-visibility pub but everything in it is just pub(crate) maybe we just make it all only pub(crate) for now?

zachschuermann commented 23 hours ago

done in #495