Open breno-costa opened 6 months ago
I need to spend more time thinking about this but I do agree the ttl
at the FeatureView
level is misleading as I had this exact experience in my last role and it caused me some headaches. I think renaming it to follow industry conventions would be good.
I'm not a big fan of a lookback
name tbh. I think it only makes sense if online store is the only component that we're focusing on. If we set aside online stores for a minute, during get_historical_features
call the way offline store handles the rows sounds compatible to the meaning of ttl
to me. When a client askს for a feature value for a specific point in time, only the values that haven't expired at that time are considered. The fact that it's doing a "lookback" to do that is an implementation detail.
I think my preferred approach would be to fix the underlying issue rather than change the parameter name. the main problem here is that as the issue indicated, ttl
is not handled the same way in the offline and online flows. It's ignored in online stores 😄. I understand there are good reasons why it might be hard to actually delete expired rows from online storage during materialization, but what we can do is to discard expired values after in the online store logic itself. Once we have that, ttl
wouldn't be so misleading anymore.
Fixing ttl
to behave as expected would be ideal. I haven't used the offline store as much but if it's using the ttl
as expected then I agree with your approach.
If you look at the documentation it says:
Feature views consist of: ... (optional) a TTL, which limits how far back Feast will look when generating historical datasets
According to Wikipedia:
Time to live (TTL) or hop limit is a mechanism which limits the lifespan or lifetime of data in a computer or network...The Time to Live is an indication of an upper bound on the lifetime of an internet datagram.
And in HTTP:
Time to live may also be expressed as the date and time on which a record expires. The Expires: header in HTTP responses, the Cache-Control: max-age header field in both requests and responses and the expires field in HTTP cookies express time-to-live in this way.
So it would be rational for the ttl
for an online Feature View to behave as an "upper bound on the lifetime of data in a database."
ttl
to offline_store_ttl
or offline_ttl
to make this name more intuitive and explicitonline_store_ttl
or online_ttl
and replicate the HTTP behavior by:
None
or Expired
along with some metadata when calling get_online_features
get_online_features
after the read is recievedfeast expire feature_view
that users could run on some scheduleThoughts @tokoko @HaoXuAI?
Regarding option 2, there are situations where features will never be fetched again for a given entity key.
Example: imagine that you have features calculated for a customer entity to be used in your product. However, some customers cancel their accounts on your product. You don't need to make inference and generate features for those customers anymore.
Materialization will no longer update features and inference endpoints will not call the get_online_features function for those customers anymore. And then, the old data will remain in the online store forever unless some cleanup is done.
I'm in favor of a mix between options 2 and 3:
expire_date
for example) in the online store for every entity in the feature view during materialization.feast cleanup
command that will physically remove all expired data from an online store. (This is only relevant for an online store, in the offline store nothing's really ever expired as the user might always want to query past information). This command should be only for housekeeping and not affect online store behavior at all, in other words it will only remove the feature values from the online store that would be discarded by online store read method anyway.Yeah, agreed. Approach (2) + (3) is the right one.
Only thing left to decide is naming conventions...do you all have any opinions here?
For example, we could continue to keep the name ttl
and just make the behavior more obvious (and document it) within each respective function call (i.e., making ttl
behave as expected for get_online_features
and get_offline_features
).
Or we could go the route of online_ttl
and renaming the current ttl
to offline_ttl
.
And I do like feast cleanup
but that may also make the user think more is happening than dropping records. Not sure.
- We store ttl information (expire_date for example) in the online store for every entity in the feature view during materialization.
I would recommend we store the expire_date
in the feature view as metadata. Changes to the expiration will be more straightforward that way.
@franciscojavierarceo For some reason, I assumed an entity timestamp was not part of the online store, my bad. if we already have an entity timestamp in there and there's a ttl field in a feature view, that first point is redundant. online store can decide whether values are expired or not based on those 2.
I believe it probably would be an improvement to permit the offline data to include an explicit "end_event_timestamp" column, which users could leverage to achieve a TTL effect if they want it.
I do not think it would be good to continue calculating TTL by adding a duration to the timestamp that is currently used to keep track of processing-time logic such as incremental materializations. I think the current "TTL" based mechanic for the offline store data (including historical queries and materialization) should be regarded as a quirk or compromise of particular offline storage implementations, and should therefore be eventually moved out of Feast's general API contract in order to make room for more capable implementations. Since it operates on the same timestamp as the incremental materializations do, it means that the concept of processing time (the time that the system handled the event - for example, the time that the row was written to the store) is unrecoverably mixed/confused with event time (the time the event actually occurred in the real world).
For example, an incremental materialization job is interested in knowing what has changed since the last processing time that the previous materialization job completed. But, from a query perspective, we may not care about the order in which the data arrived, or the particular delays that each row was subject to. The querier may only care about the event time.
The existing TTL setting for the offline store is also inconsistent with the TTL configuration/functionality (if supported) of the online store. This can be confusing, and it also means that the online store and offline store disagree in their query responses, which ideally should not happen. Look, for example, at the Redis implementation.
I am hopeful that by adding support for newer offline store technology such as Apache Iceberg (which is capable of efficiently representing non-destructive deletions and responding to time-travel queries for things like incremental materialization) and by carefully implementing Feast's corresponding online & offline stores, we can work toward a more accurate and consistent data representation (one which, for example, supports deletes not just inserts, updates, and TTLs). Basically, some data sets might find a use for TTL (to expire data in the future). But, we should regard TTL as a special case of a more general idea of facts ceasing to be true.
Just to push back a little on event_timestamp
interpretation: The way I think about it, it is very much an event time rather than a processing time, just the one that disallows late-arriving data and deletions at least as far as incremental materialization is concerned. I might be wrong, but I think it was at first designed with streaming feature generation in mind, a case in which using processing time as a proxy for an event time is probably almost always acceptable, but I still wouldn't go as far as calling it a processing time field. A user (having full control of an offline store table) can repopulate past events, make deletions, there's even a feature in feast that an offline store will look at only the latest row (in terms of load_timestamp
) if multiple rows for the same entity with exactly the same event_timestamp
are present in the offline dataset. Just the fact the we have both load_timestamp
and event_timestamp
is probably enough to know what (at least) the intention behind the fields are. I fully admit that most of these stuff is blatantly ignored by incremental materialization just like ttl is ignored by the online stores, but I don't think it's a good idea to reinterpret event_timestamp
because of those behaviors.
Having said all that, I don't think I disagree with the practical implications of what you're saying though. For example, it might be a good idea to redesign incremental materializations in a way that takes into account processing time travel capabilities of some data sources (iceberg, delta) while disallowing their usage in other scenarios (or maybe very explicitly noting about the limitations if disallowing is too drastic).
Even for streaming, due to lag (which may worsen significantly when operational incidents occur), it is important to keep event time separate from processing time. The current implementation of the incremental materialization jobs will completely miss any data that arrived late, leading to data missing from the online store (inconsistency). This should not happen. Users who run standard/default/plain-Jane incremental materialization jobs at regular intervals should be able to rely on their Online Store having the same data that's in the Offline Store, as of the time that they ran the job.
I still wouldn't go as far as calling it a processing time field.
Well, what I'm saying is that currently it's used for both. If you want it to be solely an event time field, then materialization needs to stop using it as a processing time field. I don't think I'm reinterpreting the field, I'm just describing Feast's current (mis-)interpretation.
Just the fact the we have both load_timestamp and event_timestamp is probably enough to know what (at least) the intention behind the fields are.
What load_timestamp
are you referring to? I don't see it in the code or in the docs.
there's even a feature in feast that an offline store will look at only the latest row (in terms of load_timestamp) if multiple rows for the same entity with exactly the same event_timestamp
I can't find anything about this either...
Having said all that, I don't think I disagree with the practical implications of what you're saying though. For example, it might be a good idea to redesign incremental materializations in a way that takes into account processing time travel capabilities of some data sources (iceberg, delta) while disallowing their usage in other scenarios (or maybe very explicitly noting about the limitations if disallowing is too drastic).
Ya, agreed... at some point I hope to spend some time writing design proposals or code changes around this.
Is your feature request related to a problem? Please describe. The field name
ttl
used inFeatureView
is misleading. Time to Live (TTL) is a well known term that defines a mechanism to automatically expire database records. In Redis, for example, the commandEXPIRE
defines a time to live for a given key.When I used that FeatureView.ttl configuration for the first time, I expected to set some expiration time (or TTL) for keys written into online stores. The documentation states something like this here.
However FeatureView's TTL is not used in that way. It's used to define
start_date
to extract data from offline store during materialization step here. The actual key ttl configuration used to expire entities iskey_ttl_seconds
at online store level here.I had to explain this sometimes to different people because of this misunderstanding. I think this is also kind of mentioned here and here.
Describe the solution you'd like My suggestion is renaming this field to something like
materialization_lookback_interval
,lookback_interval
orlookback_delta
since this is the actual meaning of this configuration as far as I understand. Also update docs accordingly.