Develop coherent strategy for exposing historical data in internal tables

umanwizard commented 1 year ago

It has become clear that the pattern of using internal tables as a historical record of events is common in practice. For example, the historical record of resource usage metrics, a log of sink and source errors, and a log of cluster status changes (OOMs, crashes, etc.) either are already available in the web console or will be soon.

This historical data is exposed in at least two ways today:

Only the most recent value is kept in the current version of the table, and the table's retention window is increased to the desired value (currently 30 days). Users must use time-travel queries (SELECT ... AS OF, SUBSCRIBE ... AS OF, etc.). The user should consider the differential-dataflow timestamp field to be the timestamp of the event or measurement.
The historical log of values is available in the current version of the table, and the timestamp is stored as a field of the data. The user may use normal default-timestamp queries to inspect the historical record.

The main advantages of the second approach are that it does not require knowledge of exotic features in order to use, and it also allows further SQL transformation of the result set including timestamps (e.g., bucketing), whereas the first approach requires such logic to be done on the client. The main advantage of the first approach is that it doesn't require ad-hoc compaction/cleanup logic (since Differential already does this for us). Furthermore, any machinery developed to support this approach can be straightforwardly extended to help real customers/users that have workflows that depend on historical queries. One possible mitigation of the difficulties with approach 1 is a CHANGES operator that would surface the changes to a relation as a relation that could then be further transformed. I am not sure whether we will ever have this feature or how difficult it is to implement.

This issue is intended to centralize discussion about these two methods; the ultimate goal is to settle on one to be used throughout the codebase.

frankmcsherry commented 1 year ago

Some relevant background: the data warehousing space has (probably many) existing idioms, one of which is Kimball's dimensional modeling. There are idioms for slowly changing dimensions that have the benefits that we wouldn't be inventing things from scratch. Potentially, they would be convertible into DD style collections. Additionally, they probably resemble idioms that we would be expecting from users.

chuck-alt-delete commented 1 year ago

This is highly relevant to customer use cases as well. Customers often want to analyze historical data.

For option 1, we don't currently expose the ability to customize compaction to users, so AS OF isn't currently a viable option. We can discuss the pros and cons of allowing users to enable historical analysis using AS OF with custom compaction.

For option 2, I'd say customers are most comfortable with this since it's how things work in other databases. However, we would need an easy way for users to handle "compaction" themselves in this case.

umanwizard commented 1 year ago

This is highly relevant to customer use cases as well

I think if we ever do support user-configurable compaction (and @chaas 's team does want to support it eventually), then we don't need to be opinionated here: users are free to use whatever approach they want. So the question here is more about how we choose to use the system ourselves. (Obviously, though, our experiences can and should inform product decisions about what we should improve in order to improve the experience for users too).

benesch commented 1 year ago

Copying in my comment from Slack:

I like option 1, and think we should figure out how to make it ergonomic.

matthelm commented 1 year ago

Both of these methods seem to have their own merits. Option 1 seems more immediately useful to me as a Analytics person, but as more of the ecosystem evolves to streaming I could imagine Option 2 also having a role. Ideally we don't close the door on either.

Option 1 would certainly be useful for analytics where you typically want the current state of a SQL query DAG, but once in a while you're digging deeper (e.g. an exec asking why the numbers changed) and what was seen at some time in the past. This option is nice because it adds zero complexity until you're ready to use it. It isn't really doable with any existing mutable SQL database, and so to get this today an analytics person needs to emulate this experience. It'd be great if Materialize could make it happen automagically. I could imagine this being part of the SQL standard and BI vendors adding support for this new keyword.
Option 2 seems more like an implementation detail to Materialize and it certainly requires the user to know about TVCs. I can imagine it being helpful for systems downstream of Materialize that want to consume CDCv2 en-masse rather than having to SUBSCRIBE. That said, I don't have any specific use-case in mind today.

chuck-alt-delete commented 1 year ago

We don’t have to be opinionated in the sense of restricting users, but AS OF avoids a footgun users will run into if they choose option 2, which is that it is not possible (convenient?) today to expire data from sources. Option 1 handles expiration automatically through compaction.

matthelm commented 1 year ago

A few follow-up items on Option 1 that were discussed in Slack

The ergonomics of AS OF could get tricky if, for whatever reason, we were to start batching changes. We currently process changes for each logical timestamp and this is one more reason to keep doing so.
I expect users will want newly created materialized views to backfill such that a user could drop and recreate a materialized view and assuming the new query was equivalent still be able to see how the query progresses over logical time.
With that in mind, I think it'll be important to educate users that AS OF isn't based on processing time of the materialized view but rather based on the logical timestamp of the source data. I think its best to represent this feature as the state of a materialized view for any point in logical time, rather than it being to look up the state of the table in wall-clock time.

frankmcsherry commented 1 year ago

The difference that stands out to me about options 1 and 2 is that (1.) does not allow historical modification and (2.) does allow historical modification. At least, that is the difference in what they can communicate (otherwise they seem pretty similar). Flip side of the coin is that (1.) communicates completeness for a time, and (2.) does not.

RobinClowers commented 3 months ago

My experience building the source overview page in console is relevant here. Because materialize does not produce a diff when there is no change in a row, we had to synthesize points that represent a steady state (usually 0 bytes / rows ingested, but not always). This is further complicated by having to figure out exactly what timestamp we should assign to those points so they would properly align with other points. All of this would be avoided with option 2, where we just write down the values at each collection interval, and we know exactly when the collection happened.

benesch commented 3 weeks ago

This came up on Slack in the context of converting mz_cluster_replica_metrics to a log (option 2) and then stabilizing it.

While my original preference was history retention (option 1), it's been well over a year since this issue was filed and we've made no progress on CHANGES. In the interest of pragmatism, I'm now in favor of converting all tables with historical data to logs (option 2). That will make it possible to write queries over historical data immediately, without blocking on the somewhat complicated change to the compute layer that CHANGES would be.

MaterializeInc / materialize

Develop coherent strategy for exposing historical data in internal tables #18823