Open tokoko opened 6 months ago
I think offline stores should assume that there are no overlaps
I assume that you mean that if a row for entity id X starts at time T, the row (for entity id X) with the next-oldest time on it is assumed to end at or before T even if its event_expire_timestamp
is greater than T. In this regard, its semantics seem consistent with the semantics of other fields, so that makes sense.
I believe your approach would definitely be an improvement over the existing "TTL" mechanism in the offline store.
The other similar issue is the confusion/inconsistency over whether event_timestamp is actually an event time or a processing time. Specifically, incremental materialization treats it as processing time even though other things arguably treat it as an event time.
I assume that you mean that if a row for entity id X starts at time T, the row (for entity id X) with the next-oldest time on it is assumed to end at or before T even if its
event_expire_timestamp
is greater than T. In this regard, its semantics seem consistent with the semantics of other fields, so that makes sense.
Actually I meant that the row with next-oldest time must have event_expire_timestamp
<= T, If it has a value more than T, get_historical_features
query might return duplicates. This way offline store query will look something like this -> entity_df_timestamp between event_timestamp and event_expire_timestamp
w/o any additional window operations to deduplicate rows. Do you think that might be too strict of a requirement?
If the Datasource is of type SCD 2, it simplifies the offline retrieval logic as you described. This definitely not helpful for rapidly changing data sources such Datasources will continue to follow the current model (without event_expire_timestamp).. It may not be worth implementing because
TTL for this type of Datasources in Feature View will be 0. If we don't implement event_expire_timestamp logic, partitioning by entity key list and ranking them is the extra computation. And get_historical_features retrieve only on a subset of data or defined set of entity keys in entity_df, lot of filtering happens before the partitioning and ranking so it should not be a large performance hit.
Datasources types which doesn't support row updates directly (For Example Spark backed by Parquet Files. Iceberg is an exception), they need to write extra processing logic before populating to offline store and may end up overwriting spark tables during Feature Engineering process.
This definitely not helpful for rapidly changing data sources such Datasources will continue to follow the current model (without event_expire_timestamp).
Agreed, those types of FVs should continue with the current model. To give you a bit of context here, we have a number of FVs that are calculated monthly (and can have start of the month and end of the month in event_timestamp
and event_expire_timestamp
respectively). It's pretty awkward to model them using a single ttl approach. That's the kind of use cases I'm trying to target here.
- lot of filtering happens before the partitioning and ranking so it should not be a large performance hit.
Sometimes even after filtering for a specific entity_df, you end up with large amounts of data though. I'm not saying it will be a huge performance hit for all queries, but I think it's best to avoid unnecessary steps after retrieval when we can. less surface area for something to go wrong.
2. Datasources types which doesn't support row updates directly (For Example Spark backed by Parquet Files. Iceberg is an exception), they need to write extra processing logic before populating to offline store and may end up overwriting spark tables during Feature Engineering process.
True, but that's up to the user, right? They should use the best data source format for a feature view at hand. Even without feast in picture, one would probably use iceberg/delta as an underlying format for an SCD table anyway. (I've done SCD with plain parquet files using hive metastore to make updates almost "atomic", this predated delta and iceberg though... not recommended)
It may not be worth implementing because
Although I agree with most your statements individually, do you think the reasons listed above are enough not to implement this? I feel like there is a clear subset of use cases (FVs that don't change quickly or aren't refreshed frequently) for which this will be a win.
Thanks for sharing the context. I'm not against the implementation of this feature. If there are use cases, it would definitely help implementing this. Support for multiple data models always help.
Agreed about TTLs being an awkward way to model things. In your example, if the monthly data processing is delayed for some reason, your queries might think that there are zero records in the data because they are beyond the TTL horizon, which is risky.
I think the trickiest part about requiring updates to rows is that the mutation is no longer context-free:
In your example, if the monthly data processing is delayed for some reason, your queries might think that there are zero records in the data because they are beyond the TTL horizon, which is risky.
True, but to be fair that could happen now as well if you're processing monthly with a table-level ttl of 31 days or something similar. This usually doesn't happen to us as recent feature values tend not to be used for training very quickly, but it's certainly a risk. Not sure if feast can somehow help here, for example detect (easier said than done?) and throw a warning in such a case.
I think the trickiest part about requiring updates to rows is that the mutation is no longer context-free:
I don't really disagree on any of these points, but I feel like those concerns are beside the point as those processes lie outside of feast. Managing Type 2 SCD is certainly more challenging than an append-only dataset, but there still are use cases for which it's more appropriate and we still decide to do it, right? At least, that's the case from my experience.
True, but to be fair that could happen now as well if you're processing monthly with a table-level ttl of 31 days or something similar.
Yes, exactly. IMO, we should avoid using TTLs as a way to ensure that old data is superseded by new data. Feast should provide ways to update the offline & online stores which don't rely on a tight coupling between a TTL value and a rigid timing of successfully completing a scheduled task. I think the main thing missing from Feast in this regard is the ability to represent a row which has been deleted, or some other way of modeling non-incremental full snapshots. For example, Materialization could ensure that all data in the Online Store has been replaced by the latest full snapshot in the Offline Store (without relying on TTLs to get rid of old data). Or, the Offline Store could understand incremental deletions and thereby Materialization could apply those deletions to the Online Store.
but I feel like those concerns are beside the point as those processes lie outside of feast. Managing Type 2 SCD is certainly more challenging than an append-only dataset, but there still are use cases for which it's more appropriate and we still decide to do it, right?
Perhaps, but I think they could have a big impact on the feasibility of adopting/using Feast, if there are no good reusable solutions to those problems or if they present difficult challenges to operations.
Your original description says:
I think offline stores should assume that there are no overlaps
I think that should be left up to the Offline Store implementation to determine what the best approach is. Different implementations could implement the specified functionality in different ways. I think, assuming that your event_expire_timestamp
is semantically sound, it doesn't necessarily need to be opinionated about the implementations.
Is your feature request related to a problem? Please describe. Feast only supports specifying a single
event_timestamp
value for each row in a data source. While feast also supports expiring these values, it's managed with a single feature-view level ttl value. This is useful for a lot of use cases when feature computations might happen sporadically, but doesn't really make sense when feature computations are done as part of a periodic (daily, monthly) ETL processes or when features are already in-place in data warehouse as type 2 scd tables.Describe the solution you'd like Add an option to specify a column for row expiration in data sources (something like
event_expire_timestamp
). user will be able to specify either this column or a ttl value. offline store engines will have to handle both scenarios. in caseevent_expire_timestamp
is provided, I think offline stores should assume that there are no overlaps, as in most scenarios users will have much easier time making sure there are no overlaps with scd type 2 tables themselves. This assumption will simplify offline store point-in-time logic quite a bit as offline store engine won't have to do additional window operations to get rid of duplicates.