SCD Type 2-style data sources

feast-dev / feast

The Open Source Feature Store for Machine Learning

https://feast.dev

Apache License 2.0

5.61k stars 1k forks source link

SCD Type 2-style data sources #4162

Open tokoko opened 6 months ago

tokoko commented 6 months ago

Is your feature request related to a problem? Please describe. Feast only supports specifying a single event_timestamp value for each row in a data source. While feast also supports expiring these values, it's managed with a single feature-view level ttl value. This is useful for a lot of use cases when feature computations might happen sporadically, but doesn't really make sense when feature computations are done as part of a periodic (daily, monthly) ETL processes or when features are already in-place in data warehouse as type 2 scd tables.

Describe the solution you'd like Add an option to specify a column for row expiration in data sources (something like event_expire_timestamp). user will be able to specify either this column or a ttl value. offline store engines will have to handle both scenarios. in case event_expire_timestamp is provided, I think offline stores should assume that there are no overlaps, as in most scenarios users will have much easier time making sure there are no overlaps with scd type 2 tables themselves. This assumption will simplify offline store point-in-time logic quite a bit as offline store engine won't have to do additional window operations to get rid of duplicates.

rehevkor5 commented 4 months ago

I think offline stores should assume that there are no overlaps

I assume that you mean that if a row for entity id X starts at time T, the row (for entity id X) with the next-oldest time on it is assumed to end at or before T even if its event_expire_timestamp is greater than T. In this regard, its semantics seem consistent with the semantics of other fields, so that makes sense.

I believe your approach would definitely be an improvement over the existing "TTL" mechanism in the offline store.

The other similar issue is the confusion/inconsistency over whether event_timestamp is actually an event time or a processing time. Specifically, incremental materialization treats it as processing time even though other things arguably treat it as an event time.

tokoko commented 4 months ago

I assume that you mean that if a row for entity id X starts at time T, the row (for entity id X) with the next-oldest time on it is assumed to end at or before T even if its event_expire_timestamp is greater than T. In this regard, its semantics seem consistent with the semantics of other fields, so that makes sense.

Actually I meant that the row with next-oldest time must have event_expire_timestamp <= T, If it has a value more than T, get_historical_features query might return duplicates. This way offline store query will look something like this -> entity_df_timestamp between event_timestamp and event_expire_timestamp w/o any additional window operations to deduplicate rows. Do you think that might be too strict of a requirement?

EXPEbdodla commented 4 months ago

If the Datasource is of type SCD 2, it simplifies the offline retrieval logic as you described. This definitely not helpful for rapidly changing data sources such Datasources will continue to follow the current model (without event_expire_timestamp).. It may not be worth implementing because

TTL for this type of Datasources in Feature View will be 0. If we don't implement event_expire_timestamp logic, partitioning by entity key list and ranking them is the extra computation. And get_historical_features retrieve only on a subset of data or defined set of entity keys in entity_df, lot of filtering happens before the partitioning and ranking so it should not be a large performance hit.
Datasources types which doesn't support row updates directly (For Example Spark backed by Parquet Files. Iceberg is an exception), they need to write extra processing logic before populating to offline store and may end up overwriting spark tables during Feature Engineering process.

tokoko commented 4 months ago

This definitely not helpful for rapidly changing data sources such Datasources will continue to follow the current model (without event_expire_timestamp).

Agreed, those types of FVs should continue with the current model. To give you a bit of context here, we have a number of FVs that are calculated monthly (and can have start of the month and end of the month in event_timestamp and event_expire_timestamp respectively). It's pretty awkward to model them using a single ttl approach. That's the kind of use cases I'm trying to target here.

lot of filtering happens before the partitioning and ranking so it should not be a large performance hit.

Sometimes even after filtering for a specific entity_df, you end up with large amounts of data though. I'm not saying it will be a huge performance hit for all queries, but I think it's best to avoid unnecessary steps after retrieval when we can. less surface area for something to go wrong.

2. Datasources types which doesn't support row updates directly (For Example Spark backed by Parquet Files. Iceberg is an exception), they need to write extra processing logic before populating to offline store and may end up overwriting spark tables during Feature Engineering process.

True, but that's up to the user, right? They should use the best data source format for a feature view at hand. Even without feast in picture, one would probably use iceberg/delta as an underlying format for an SCD table anyway. (I've done SCD with plain parquet files using hive metastore to make updates almost "atomic", this predated delta and iceberg though... not recommended)

It may not be worth implementing because

Although I agree with most your statements individually, do you think the reasons listed above are enough not to implement this? I feel like there is a clear subset of use cases (FVs that don't change quickly or aren't refreshed frequently) for which this will be a win.

EXPEbdodla commented 3 months ago

Thanks for sharing the context. I'm not against the implementation of this feature. If there are use cases, it would definitely help implementing this. Support for multiple data models always help.

rehevkor5 commented 3 months ago

Agreed about TTLs being an awkward way to model things. In your example, if the monthly data processing is delayed for some reason, your queries might think that there are zero records in the data because they are beyond the TTL horizon, which is risky.

I think the trickiest part about requiring updates to rows is that the mutation is no longer context-free:

the updater must first read the existing content of the database in order to know which rows need to be updated, which could slow down the updates/make them more expensive to run
the quantity of the data/files rewritten by the update will probably be bigger since you need both to add the new rows and alter the old rows
I am guessing there could easily be race conditions when writes are concurrent, which might result in incorrect data or lack of integrity (occurrence of disallowed overlapping time ranges). I am not sure whether Iceberg's snapshot lock/serialization & retry logic would not help you because the logic for performing the read & determining the updates rely on code that you write outside of Iceberg

tokoko commented 3 months ago

In your example, if the monthly data processing is delayed for some reason, your queries might think that there are zero records in the data because they are beyond the TTL horizon, which is risky.

True, but to be fair that could happen now as well if you're processing monthly with a table-level ttl of 31 days or something similar. This usually doesn't happen to us as recent feature values tend not to be used for training very quickly, but it's certainly a risk. Not sure if feast can somehow help here, for example detect (easier said than done?) and throw a warning in such a case.

I think the trickiest part about requiring updates to rows is that the mutation is no longer context-free:

I don't really disagree on any of these points, but I feel like those concerns are beside the point as those processes lie outside of feast. Managing Type 2 SCD is certainly more challenging than an append-only dataset, but there still are use cases for which it's more appropriate and we still decide to do it, right? At least, that's the case from my experience.

rehevkor5 commented 3 months ago

True, but to be fair that could happen now as well if you're processing monthly with a table-level ttl of 31 days or something similar.

Yes, exactly. IMO, we should avoid using TTLs as a way to ensure that old data is superseded by new data. Feast should provide ways to update the offline & online stores which don't rely on a tight coupling between a TTL value and a rigid timing of successfully completing a scheduled task. I think the main thing missing from Feast in this regard is the ability to represent a row which has been deleted, or some other way of modeling non-incremental full snapshots. For example, Materialization could ensure that all data in the Online Store has been replaced by the latest full snapshot in the Offline Store (without relying on TTLs to get rid of old data). Or, the Offline Store could understand incremental deletions and thereby Materialization could apply those deletions to the Online Store.

but I feel like those concerns are beside the point as those processes lie outside of feast. Managing Type 2 SCD is certainly more challenging than an append-only dataset, but there still are use cases for which it's more appropriate and we still decide to do it, right?

Perhaps, but I think they could have a big impact on the feasibility of adopting/using Feast, if there are no good reusable solutions to those problems or if they present difficult challenges to operations.

Your original description says:

I think offline stores should assume that there are no overlaps

I think that should be left up to the Offline Store implementation to determine what the best approach is. Different implementations could implement the specified functionality in different ways. I think, assuming that your event_expire_timestamp is semantically sound, it doesn't necessarily need to be opinionated about the implementations.