Open umanwizard opened 1 year ago
Some relevant background: the data warehousing space has (probably many) existing idioms, one of which is Kimball's dimensional modeling. There are idioms for slowly changing dimensions that have the benefits that we wouldn't be inventing things from scratch. Potentially, they would be convertible into DD style collections. Additionally, they probably resemble idioms that we would be expecting from users.
This is highly relevant to customer use cases as well. Customers often want to analyze historical data.
For option 1, we don't currently expose the ability to customize compaction to users, so AS OF
isn't currently a viable option. We can discuss the pros and cons of allowing users to enable historical analysis using AS OF
with custom compaction.
For option 2, I'd say customers are most comfortable with this since it's how things work in other databases. However, we would need an easy way for users to handle "compaction" themselves in this case.
This is highly relevant to customer use cases as well
I think if we ever do support user-configurable compaction (and @chaas 's team does want to support it eventually), then we don't need to be opinionated here: users are free to use whatever approach they want. So the question here is more about how we choose to use the system ourselves. (Obviously, though, our experiences can and should inform product decisions about what we should improve in order to improve the experience for users too).
Copying in my comment from Slack:
I like option 1, and think we should figure out how to make it ergonomic.
Both of these methods seem to have their own merits. Option 1 seems more immediately useful to me as a Analytics person, but as more of the ecosystem evolves to streaming I could imagine Option 2 also having a role. Ideally we don't close the door on either.
We don’t have to be opinionated in the sense of restricting users, but AS OF
avoids a footgun users will run into if they choose option 2, which is that it is not possible (convenient?) today to expire data from sources. Option 1 handles expiration automatically through compaction.
A few follow-up items on Option 1 that were discussed in Slack
AS OF
could get tricky if, for whatever reason, we were to start batching changes. We currently process changes for each logical timestamp and this is one more reason to keep doing so.The difference that stands out to me about options 1 and 2 is that (1.) does not allow historical modification and (2.) does allow historical modification. At least, that is the difference in what they can communicate (otherwise they seem pretty similar). Flip side of the coin is that (1.) communicates completeness for a time, and (2.) does not.
My experience building the source overview page in console is relevant here. Because materialize does not produce a diff when there is no change in a row, we had to synthesize points that represent a steady state (usually 0 bytes / rows ingested, but not always). This is further complicated by having to figure out exactly what timestamp we should assign to those points so they would properly align with other points. All of this would be avoided with option 2, where we just write down the values at each collection interval, and we know exactly when the collection happened.
This came up on Slack in the context of converting mz_cluster_replica_metrics
to a log (option 2) and then stabilizing it.
While my original preference was history retention (option 1), it's been well over a year since this issue was filed and we've made no progress on CHANGES
. In the interest of pragmatism, I'm now in favor of converting all tables with historical data to logs (option 2). That will make it possible to write queries over historical data immediately, without blocking on the somewhat complicated change to the compute layer that CHANGES
would be.
It has become clear that the pattern of using internal tables as a historical record of events is common in practice. For example, the historical record of resource usage metrics, a log of sink and source errors, and a log of cluster status changes (OOMs, crashes, etc.) either are already available in the web console or will be soon.
This historical data is exposed in at least two ways today:
SELECT ... AS OF
,SUBSCRIBE ... AS OF
, etc.). The user should consider the differential-dataflow timestamp field to be the timestamp of the event or measurement.The main advantages of the second approach are that it does not require knowledge of exotic features in order to use, and it also allows further SQL transformation of the result set including timestamps (e.g., bucketing), whereas the first approach requires such logic to be done on the client. The main advantage of the first approach is that it doesn't require ad-hoc compaction/cleanup logic (since Differential already does this for us). Furthermore, any machinery developed to support this approach can be straightforwardly extended to help real customers/users that have workflows that depend on historical queries. One possible mitigation of the difficulties with approach 1 is a
CHANGES
operator that would surface the changes to a relation as a relation that could then be further transformed. I am not sure whether we will ever have this feature or how difficult it is to implement.This issue is intended to centralize discussion about these two methods; the ultimate goal is to settle on one to be used throughout the codebase.