feast-dev / feast

The Open Source Feature Store for Machine Learning

https://feast.dev

Apache License 2.0

5.61k stars 999 forks source link

Feature View's ttl is misleading #4133

Open breno-costa opened 6 months ago

breno-costa commented 6 months ago

Is your feature request related to a problem? Please describe. The field name ttl used in FeatureView is misleading. Time to Live (TTL) is a well known term that defines a mechanism to automatically expire database records. In Redis, for example, the command EXPIRE defines a time to live for a given key.

When I used that FeatureView.ttl configuration for the first time, I expected to set some expiration time (or TTL) for keys written into online stores. The documentation states something like this here.

The amount of time this group of features lives. A ttl of 0 indicates that this group of features lives forever.

However FeatureView's TTL is not used in that way. It's used to define start_date to extract data from offline store during materialization step here. The actual key ttl configuration used to expire entities is key_ttl_seconds at online store level here.

I had to explain this sometimes to different people because of this misunderstanding. I think this is also kind of mentioned here and here.

Describe the solution you'd like My suggestion is renaming this field to something like materialization_lookback_interval, lookback_interval or lookback_delta since this is the actual meaning of this configuration as far as I understand. Also update docs accordingly.

franciscojavierarceo commented 6 months ago

I need to spend more time thinking about this but I do agree the ttl at the FeatureView level is misleading as I had this exact experience in my last role and it caused me some headaches. I think renaming it to follow industry conventions would be good.

tokoko commented 6 months ago

I'm not a big fan of a lookback name tbh. I think it only makes sense if online store is the only component that we're focusing on. If we set aside online stores for a minute, during get_historical_features call the way offline store handles the rows sounds compatible to the meaning of ttl to me. When a client askს for a feature value for a specific point in time, only the values that haven't expired at that time are considered. The fact that it's doing a "lookback" to do that is an implementation detail.

I think my preferred approach would be to fix the underlying issue rather than change the parameter name. the main problem here is that as the issue indicated, ttl is not handled the same way in the offline and online flows. It's ignored in online stores 😄. I understand there are good reasons why it might be hard to actually delete expired rows from online storage during materialization, but what we can do is to discard expired values after in the online store logic itself. Once we have that, ttl wouldn't be so misleading anymore.

franciscojavierarceo commented 6 months ago

Fixing ttl to behave as expected would be ideal. I haven't used the offline store as much but if it's using the ttl as expected then I agree with your approach.

franciscojavierarceo commented 6 months ago

If you look at the documentation it says:

Feature views consist of: ... (optional) a TTL, which limits how far back Feast will look when generating historical datasets

According to Wikipedia:

Time to live (TTL) or hop limit is a mechanism which limits the lifespan or lifetime of data in a computer or network...The Time to Live is an indication of an upper bound on the lifetime of an internet datagram.

And in HTTP:

Time to live may also be expressed as the date and time on which a record expires. The Expires: header in HTTP responses, the Cache-Control: max-age header field in both requests and responses and the expires field in HTTP cookies express time-to-live in this way.

So it would be rational for the ttl for an online Feature View to behave as an "upper bound on the lifetime of data in a database."

Options

We could change ttl to offline_store_ttl or offline_ttl to make this name more intuitive and explicit
We could add another parameter called online_store_ttl or online_ttl and replicate the HTTP behavior by:
- returning None or Expired along with some metadata when calling get_online_features
- Dropping the record in the database when calling get_online_features after the read is recieved
We could create a command to expire offline data or online data in batch and call it something like feast expire feature_view that users could run on some schedule

Thoughts @tokoko @HaoXuAI?

breno-costa commented 6 months ago

Regarding option 2, there are situations where features will never be fetched again for a given entity key.

Example: imagine that you have features calculated for a customer entity to be used in your product. However, some customers cancel their accounts on your product. You don't need to make inference and generate features for those customers anymore.

Materialization will no longer update features and inference endpoints will not call the get_online_features function for those customers anymore. And then, the old data will remain in the online store forever unless some cleanup is done.

tokoko commented 6 months ago

I'm in favor of a mix between options 2 and 3:

We store ttl information (expire_date for example) in the online store for every entity in the feature view during materialization.
online store reader treats expired feature values in an online store as if they're not there, discards them after reading from the database.
introduce feast cleanup command that will physically remove all expired data from an online store. (This is only relevant for an online store, in the offline store nothing's really ever expired as the user might always want to query past information). This command should be only for housekeeping and not affect online store behavior at all, in other words it will only remove the feature values from the online store that would be discarded by online store read method anyway.

franciscojavierarceo commented 6 months ago

Yeah, agreed. Approach (2) + (3) is the right one.

Only thing left to decide is naming conventions...do you all have any opinions here?

For example, we could continue to keep the name ttl and just make the behavior more obvious (and document it) within each respective function call (i.e., making ttl behave as expected for get_online_features and get_offline_features).

Or we could go the route of online_ttl and renaming the current ttl to offline_ttl.

And I do like feast cleanup but that may also make the user think more is happening than dropping records. Not sure.

franciscojavierarceo commented 6 months ago

We store ttl information (expire_date for example) in the online store for every entity in the feature view during materialization.

I would recommend we store the expire_date in the feature view as metadata. Changes to the expiration will be more straightforward that way.

tokoko commented 6 months ago

@franciscojavierarceo For some reason, I assumed an entity timestamp was not part of the online store, my bad. if we already have an entity timestamp in there and there's a ttl field in a feature view, that first point is redundant. online store can decide whether values are expired or not based on those 2.

rehevkor5 commented 4 months ago

I believe it probably would be an improvement to permit the offline data to include an explicit "end_event_timestamp" column, which users could leverage to achieve a TTL effect if they want it.

I do not think it would be good to continue calculating TTL by adding a duration to the timestamp that is currently used to keep track of processing-time logic such as incremental materializations. I think the current "TTL" based mechanic for the offline store data (including historical queries and materialization) should be regarded as a quirk or compromise of particular offline storage implementations, and should therefore be eventually moved out of Feast's general API contract in order to make room for more capable implementations. Since it operates on the same timestamp as the incremental materializations do, it means that the concept of processing time (the time that the system handled the event - for example, the time that the row was written to the store) is unrecoverably mixed/confused with event time (the time the event actually occurred in the real world).

For example, an incremental materialization job is interested in knowing what has changed since the last processing time that the previous materialization job completed. But, from a query perspective, we may not care about the order in which the data arrived, or the particular delays that each row was subject to. The querier may only care about the event time.

The existing TTL setting for the offline store is also inconsistent with the TTL configuration/functionality (if supported) of the online store. This can be confusing, and it also means that the online store and offline store disagree in their query responses, which ideally should not happen. Look, for example, at the Redis implementation.

I am hopeful that by adding support for newer offline store technology such as Apache Iceberg (which is capable of efficiently representing non-destructive deletions and responding to time-travel queries for things like incremental materialization) and by carefully implementing Feast's corresponding online & offline stores, we can work toward a more accurate and consistent data representation (one which, for example, supports deletes not just inserts, updates, and TTLs). Basically, some data sets might find a use for TTL (to expire data in the future). But, we should regard TTL as a special case of a more general idea of facts ceasing to be true.

tokoko commented 4 months ago

Just to push back a little on event_timestamp interpretation: The way I think about it, it is very much an event time rather than a processing time, just the one that disallows late-arriving data and deletions at least as far as incremental materialization is concerned. I might be wrong, but I think it was at first designed with streaming feature generation in mind, a case in which using processing time as a proxy for an event time is probably almost always acceptable, but I still wouldn't go as far as calling it a processing time field. A user (having full control of an offline store table) can repopulate past events, make deletions, there's even a feature in feast that an offline store will look at only the latest row (in terms of load_timestamp) if multiple rows for the same entity with exactly the same event_timestamp are present in the offline dataset. Just the fact the we have both load_timestamp and event_timestamp is probably enough to know what (at least) the intention behind the fields are. I fully admit that most of these stuff is blatantly ignored by incremental materialization just like ttl is ignored by the online stores, but I don't think it's a good idea to reinterpret event_timestamp because of those behaviors.

Having said all that, I don't think I disagree with the practical implications of what you're saying though. For example, it might be a good idea to redesign incremental materializations in a way that takes into account processing time travel capabilities of some data sources (iceberg, delta) while disallowing their usage in other scenarios (or maybe very explicitly noting about the limitations if disallowing is too drastic).

rehevkor5 commented 3 months ago

Even for streaming, due to lag (which may worsen significantly when operational incidents occur), it is important to keep event time separate from processing time. The current implementation of the incremental materialization jobs will completely miss any data that arrived late, leading to data missing from the online store (inconsistency). This should not happen. Users who run standard/default/plain-Jane incremental materialization jobs at regular intervals should be able to rely on their Online Store having the same data that's in the Offline Store, as of the time that they ran the job.

I still wouldn't go as far as calling it a processing time field.

Well, what I'm saying is that currently it's used for both. If you want it to be solely an event time field, then materialization needs to stop using it as a processing time field. I don't think I'm reinterpreting the field, I'm just describing Feast's current (mis-)interpretation.

Just the fact the we have both load_timestamp and event_timestamp is probably enough to know what (at least) the intention behind the fields are.

What load_timestamp are you referring to? I don't see it in the code or in the docs.

there's even a feature in feast that an offline store will look at only the latest row (in terms of load_timestamp) if multiple rows for the same entity with exactly the same event_timestamp

I can't find anything about this either...

Having said all that, I don't think I disagree with the practical implications of what you're saying though. For example, it might be a good idea to redesign incremental materializations in a way that takes into account processing time travel capabilities of some data sources (iceberg, delta) while disallowing their usage in other scenarios (or maybe very explicitly noting about the limitations if disallowing is too drastic).

Ya, agreed... at some point I hope to spend some time writing design proposals or code changes around this.