Reporting: Lifetime stats

dyaffe commented 1 month ago

Goal Understand how a data flow is progressing and whether / how much a materialization is behind a capture.

Proposal

We don’t need too much detail, just a running total for how much a collection has read / written.
If possible, resetting that on truncation when we support truncation.
Showing a % completed in the collections view and potentially materializations view using this.

jgraettinger commented 1 month ago

quick thoughts:

Bucket life-cycle policies remove data -- that means, if I create a materialization for a collection after the fact, I will never see as many documents read as have been written, and a % completion metric can never be accurate. This seems a likely potential source of confusion.
Is this really just a larger grain of time than "month"? "year"?
If we introduce a "compaction" feature for a collection, that also could reduce the number of docs / bytes I need to actually read -- though compaction can likely be framed as a truncation, which makes them the same problem.
These smell a bit like guages (rather than countesr) that are tracked and reported by tasks -- "I've captured this many docs / bytes since the binding was last truncated" or "I've read this many docs / bytes since I last saw a truncation for this binding"

jgraettinger commented 1 month ago

Discussed options:

Derivations / Materializations can self-report the maximum publication time they've read through in each transaction.
Derivations / Materializations can self-report the observed delta between journal read offset and journal write head (summed across all journals).

estuary / flow