Closed OussamaSaoudi-db closed 23 hours ago
Attention: Patch coverage is 92.15686%
with 36 lines
in your changes missing coverage. Please review.
Project coverage is 79.97%. Comparing base (
bb1be88
) to head (f8457d9
).
Files with missing lines | Patch % | Lines |
---|---|---|
kernel/src/log_segment.rs | 92.47% | 24 Missing and 10 partials :warning: |
kernel/src/snapshot.rs | 71.42% | 1 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@nicklan @scovich @zachschuermann I was working on CDF and needed a way to keep track of the commit version. Currently, LogSegment
holds commit_files: Vec<FileMeta>
, and checkpoint_files: Vec<FileMeta>
. This only holds the path and modification time. It would be really useful to keep all the commit version in ParsedLogFile
. Here are some alternatives I considered:
commit_files.iter().zip(start_version..=end_version)
to keep track of versions when iterating over files. I feel like this could be error prone, but nothing some checks wouldn't fixcommit_file.path
. This feels icky.How do you feel about LogSegment
having commit_files: Vec<ParsedLogPath>
instead of the current commit_files: Vec<FileMeta>
? The opportunities I see are the following:
ParsedLogPath
. We're certain this is the right commit number.version
, but we can't use build
because that projects away version info. I think keeping version info and file_type
from ParsedLogPath
improves the testability of LogSegment
and by extension, Snapshot
.The main drawbacks I see is that we're stuck moving around more data, and the LogPathFileType
enum may not be FFI friendly. This is also a relatively larger change, so maybe we can defer this and do the alternatives I considered above.
Hey @OussamaSaoudi-db thanks for that message I broadly agree with your ideas and (without too much digging yet) sounds like doing both commit_files
and checkpoint_files
as vecs of ParsedLogPath
makes a lot of sense. Perhaps we can do that as a quick pre-factor to this PR?
Regarding the downsides:
The main drawbacks I see is that we're stuck moving around more data, and the
LogPathFileType
enum may not be FFI friendly. This is also a relatively larger change, so maybe we can defer this and do the alternatives I considered above.
I doubt the marginal increase in size of the structs themselves will have meaningful impact (or, put another way, whatever impact is likely worth it). Do we expose LogSegment
s in FFI? right now it's pub(crate)
with developer-visibility
for pub
right?
Oh and note that LogSegmentBuilder
is pub(crate)
with developer-visibility pub
but everything in it is just pub(crate)
maybe we just make it all only pub(crate)
for now?
done in #495
What changes are proposed in this pull request?
This pull request builds on #438 by creating a
LogSegmentBuilder
. This moves all logic for building aLogSegment
out ofSnapshot
. This change is made in anticipation ofTableChanges
, which will represent CDFs and must construct its ownLogSegment
.The builder allows you to specify the following:
LogSegment
.How was this change tested?
New tests are added to check the following:
with_omit_checkpoint_files
is specified.