feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.56k stars 993 forks source link

Latest Only option for Historical Retrieval #1687

Open 8bit-pixies opened 3 years ago

8bit-pixies commented 3 years ago

Is your feature request related to a problem? Please describe.

In many batch workflows, it is worthwhile to retrieve the latest features by entity only. This is useful from the purposes of both production and backtesting purposes.

E.g. if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn't really use the online store for this.

Describe the solution you'd like

Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest. Depends on #1611

my_daily_batch_scoring_df = store.get_latest_features(
    entity_df = "my_df", 
    feature_refs = [...],
)

Additional context Linked issue #1611

woop commented 3 years ago

Thanks for raising this @charliec443

This is useful from the purposes of both production and backtesting purposes

I think it would be useful to be explicit in your problem statement. What aspect of the existing API makes it incapable (or inconvenient) for your use case? Why are the latest values used for backtesting, and not historic values? I would have expected backtesting to use historic values.

if I have an hourly/daily batch which goes through our whole customer base to find fraudulent customers, we wouldn't really use the online store for this.

The first part of this sentence doesn't really connect to the second. I'm a bit confused as to what you mean.

Allow users to specify an entity set extracted from a feature view should have an option to be deduplicated by latest. Depends on #1611

@MattDelac is this API moving closer to what you are using internally?

MattDelac commented 3 years ago

@MattDelac is this API moving closer to what you are using internally?

Not really

But we have the same need for batch predictions where we want to predict the latest values of the features in batch. Therefore we could bypass the historical retrieval logic and have a SQL template that is much more efficient.

In terms of API i would rather have another API eg: store.get_latest_features() rather than a boolean parameter. And as I said, store.get_latest_features() could be a very efficient SQL query

Hope that makes sense

woop commented 3 years ago

@MattDelac is this API moving closer to what you are using internally?

Not really

But we have the same need for batch predictions where we want to predict the latest values of the features in batch. Therefore we could bypass the historical retrieval logic and have a SQL template that is much more efficient.

In terms of API i would rather have another API eg: store.get_latest_features() rather than a boolean parameter. And as I said, store.get_latest_features() could be a very efficient SQL query

Hope that makes sense

store.get_latest_features() could be a shared method that is also used for materialization into the online store. Seems like a good idea to me.

8bit-pixies commented 3 years ago

The first part of this sentence doesn't really connect to the second. I'm a bit confused as to what you mean.

That's fair because I don't have a clear vision right now. Where the existing API might be clunky for back testing in batch is because we might want to partition by a whole feature view, which can't easily be filtered by time (and I'm more than happy to be challenged that this is a "too hard" or I'm doing it wrong)

Prediction problem: Fraud detection over the customer base Input feature groups:

Sample data:

Customer demographics

CUST_NUM GENDER START_DATE
123 F 2001-01-01
456 M 2001-01-01
789 NA 2001-01-01

Customer Event

CUST_NUM EVENT EVENT_DATE
123 1 today - 10 days
456 10 today - 10 days
789 100 today - 200 days

Customer Call Transcript

CUST_NUM Transcript EVENT_DATE
789 Hello World today - 10 days

In this example, for back testing for data "10 days ago", we want to filter by our whole customer base (i.e. use the feature view "customer demographics"), but when we get the features out based on my proposed sample data, each of customer: 123, 456, 789 should appear in the dataset despite not being updated in the main view.

After thinking out a loud maybe this is a "too hard" won't do. Or have an entirely different solution which is to generate a dataset with CUST_NUM, SNAPSHOT_DATE as an entity_df instead

Though store.get_latest_features() maybe the more appropriate start to this challenge.

woop commented 3 years ago

Thanks for this @charliec443

dataset despite not being updated in the main view

What is the "main view" here?

8bit-pixies commented 3 years ago

Sorry about that, I wasn't being clear here was I.

I'll try to re-frame this problem with the lens of what I've observed, and it might just come down to "this is a robotics processing automation issue, not a feast issue" + "data scientists need to write custom code" or "this is some kind of on-line transformation feature that would come in the future"...

Problem Statement: for our marketing model, we:

  1. filter by customers who have had an "interaction" with us in the last 10 days
  2. perform model scoring for back test

The challenge here is an "interaction" is based on data on two tables. So perhaps a more appropriate "Feast" solution is to create a new feature table(?) that has the combined interaction information to filter before grabbing data from the respective "event" and "call transcript" tables.

Using only Event data

get_latest("customer_event", start_date="today - 10", end_date="today")

May inadvertently filter out customer 789

Using only call transcript

get_latest("call_transcript", start_date="today - 10", end_date="today")

Would only keep customer 789

Possible issues with "custom transformation"

If you had a custom transformation for the purposes of filtering then this could be really messy in production (as always...) as the tables you would generate would be specific to this pipeline, then having 100 models would lead to 100 such tables. Perhaps this is a necessary evil to simplify a feature store API

It would then be:

Customer demographics

CUST_NUM GENDER START_DATE
123 F 2001-01-01
456 M 2001-01-01
789 NA 2001-01-01

Customer Event

CUST_NUM EVENT EVENT_DATE
123 1 today - 10 days
456 10 today - 10 days
789 100 today - 200 days

Customer Call Transcript

CUST_NUM Transcript EVENT_DATE
789 Hello World today - 10 days

My custom transformation to derive how a training dataset gets automatically filters (Customer_last_interaction):

CUST_NUM INTERACTION_DATE INTERACTION_TYPE
123 today - 10 days EVENT
456 today - 10 days EVENT
789 today - 10 days EVENT
789 today - 200 days CALL

Then we would create the training set via:

get_latest("customer_last_interaction", start_date="today-10", end_date="today")

Other Solutions

Perhaps the most obvious one is to support list of entities which are "magically" concatenated by entity id + event timestamp only. This just creates a mess if people combine List(string) and List(dataframes), especially if the views/entity_df have different columns

This might just be a topic to be discussed later...it certainly doesn't need to be "solved" before having a solution which tackles majority of usecases

8bit-pixies commented 3 years ago

Matt's comment here: https://github.com/feast-dev/feast/issues/1611#issuecomment-880872664 touches on this in a way.

In this setting, if we infer based on the features its constructed through assuming all entity keys are used, and we first create an entity X event_timestamp dataframe which is used as the basis for the get_historical_features method.

This approach allows mixing of entity "views", though this may be counterintuitive (can be fixed with documentation!).

Trying to explain this in words is proving to be overly complicated in my head though (apologies if it doesn't make total sense)...

Basically it boils down to this:

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MattDelac commented 2 years ago

I still believe that this is an important feature for batch prediction pipelines. In that case you need the latest values from the offline store.

You also need to keep this idea of an "entity_df" that we don't have with the pull_latest_from_table_or_query() method

adchia commented 2 years ago

@vas28r13 note: this is probably the better approach and mirrors what we discussed.

lokeshrangineni commented 8 months ago

I'm new to Feast codebase and wanted to contribute to the project. if no one has any objection then probably I would like to start analyzing this task and implement it if it is a good one for a newbie like me.