Is your feature request related to a problem? Please describe.
In our current setup, we use the following pipelines to construct the different sources and train a model on transaction payment data:
Feature engineering ETL: "transaction" table -> query and save feature values to source for company_statistics:transaction_volume_last_7d
Feature engineering ETL: "transaction" table -> query and save feature values to source for payment_facts:amount
Model training: "transaction" table -> query entity_df -> Feast join to get training dataset for features: ["company_statistics:transaction_volume_last_7d", "payment_facts:amount"].
What we now see is that the "factual feature" that we have saved in the payment_facts source table is oftentimes also available in the same row when we query the same source for entity_df. This means that currently we explicitly drop these factual columns (amount) from the query that returns our entity_df, only to query them back using feast during model training.
This works, but painpoints are that:
We have to perform additional unnecessary joins for such facts (and data gets duplicated).
It is important that the event timestamps in entity_df and payment_facts rows match exactly. Otherwise we miss the fact during the Feast join (we solve this by making sure that we always query the same source table transaction).
Benefits to current approach and saving features explicitly with their own sources include:
Single query interface for all types of features
entity_df is not strictly defined. Not defining a feature view for factual feature means those features are not defined in the feature store anywhere.
Describe the solution you'd like
Would like to discuss what we can do to avoid this pattern where we already have the data on the entity_df, but then drop those columns from entity_df, only to query them back from the offline feature store later.
Some options I see include:
To not define these "factual" features in the feature store, and only define the "aggregate" features in the store.
Downside is that there is no definition anywhere of what those features are. This means the feature store will not have a complete view of what features exist in the platform anymore.
Query the entity_df directly from the source defined for the FeatureView.
Currently this does not seem to fit very well with existing objects, as it would mean that you have "special" FeatureView sources which are also used as a source of truth for entities during for training. Would prefer to use the actual source of truth (that is the origin of the source during ETL, transaction in this case) to construct the entity_df.
To extend the FeatureService or create a new object that captures the information that there are different types of features that are "provided" with the query to the store.
This means that they are documented at least somewhere in the feature store.
These features are also oftenthe ones that we will not lookup from the online store during serving (since they are directly provided on the user request). This means that also during serving, they are in a way "provided" on demand, similar to how they might be "provided" as part of the entity_df.
Putting a reference to the term "OnDemandFeature" from the FeatureService RFC here since I believe it might be used to describe the same concept as what I mean here with "provided".
It seems to me that 3) could be a viable approach. We could change the interface (See constructor call FeatureService) to look something like the following:
"""
transaction table:
| payment | company | amount | timestamp |
|--------:|---------|--------|------------------|
| 1 | A | 10 | 2021-05-05 10:00 |
| 2 | B | 20 | 2021-05-05 11:00 |
| 3 | C | 10 | 2021-05-05 12:00 |
| 4 | C | 30 | 2021-05-05 13:00 |
"""
company_stats_fv = FeatureView(
name="company_stats",
entities=["company"],
features=[
Feature(name="total_transactions_last_7d", dtype=ValueType.FLOAT),
]
)
"""
company_stats_fv source after ETL:
| company | total_transactions_last_7d | timestamp |
|---------|----------------------------|------------------|
| A | 1 | 2021-05-05 15:00 |
| B | 1 | 2021-05-05 15:00 |
| C | 2 | 2021-05-05 15:00 |
| C | 2 | 2021-05-05 15:00 |
"""
# Query transaction table to get the relevant entities and on demand features for the model
entity_df = # SELECT payment, company, amount as payment__amount from transaction;
"""
entity_df result, contains a feature as well:
| payment | company | payment__amount | timestamp |
|--------:|---------|-----------------|------------------|
| 1 | A | 10 | 2021-05-05 10:00 |
| 2 | B | 20 | 2021-05-05 11:00 |
| 3 | C | 10 | 2021-05-05 12:00 |
| 4 | C | 30 | 2021-05-05 13:00 |
"""
# Query feast with FeatureService
feature_service = FeatureService(
name="team_model",
features=[
OnDemandFeature("payment__amount"),
Feature("company__total_transactions_last_7d"),
],
)
training_df = store.get_historical_features(
feature_service=feature_service,
entity_df=entity_df
)
"""
training dataset:
| payment | company | payment__amount | company__total_transactions_last_7d | timestamp |
|--------:|---------|-----------------|-------------------------------------|------------------|
| 1 | A | 10 | 1 | 2021-05-05 10:00 |
| 2 | B | 20 | 1 | 2021-05-05 11:00 |
| 3 | C | 10 | 2 | 2021-05-05 12:00 |
| 4 | C | 30 | 2 | 2021-05-05 13:00 |
"""
What this changes is that we:
Do not duplicate the data, and query it with the entity_df.
Do not register the OnDemandFeature with a FeatureView, but still register it somewhere, namely with the FeatureService.
Describe alternatives you've considered
See proposed option 1) and 2).
Additional context
Example existing flow, saving data and then retrieving it again. See difference proposed flow in where we query for entity_df.
"""
transaction table:
| payment | company | amount | timestamp |
|--------:|---------|--------|------------------|
| 1 | A | 10 | 2021-05-05 10:00 |
| 2 | B | 20 | 2021-05-05 11:00 |
| 3 | C | 10 | 2021-05-05 12:00 |
| 4 | C | 30 | 2021-05-05 13:00 |
"""
payment_facts_fv = FeatureView(
name="payment_facts",
entities=["payment"],
features=[
Feature(name="amount", dtype=ValueType.FLOAT),
]
)
"""
payment_facts source after running ETL:
| payment | amount | timestamp |
|--------:|--------|------------------|
| 1 | 10 | 2021-05-05 10:00 |
| 2 | 20 | 2021-05-05 11:00 |
| 3 | 10 | 2021-05-05 12:00 |
| 4 | 30 | 2021-05-05 13:00 |
"""
company_stats_fv = FeatureView(
name="company_stats",
entities=["company"],
features=[
Feature(name="total_transactions_last_7d", dtype=ValueType.FLOAT),
]
)
"""
company_stats source after running ETL:
| company | total_transactions_last_7d | timestamp |
|---------|----------------------------|------------------|
| A | 1 | 2021-05-05 15:00 |
| B | 1 | 2021-05-05 15:00 |
| C | 2 | 2021-05-05 15:00 |
| C | 2 | 2021-05-05 15:00 |
"""
# Query transaction table to get the relevant entities for the model
entity_df = # SELECT payment, company from transaction;
"""
entity_df:
| payment | company | timestamp |
|--------:|---------|------------------|
| 1 | A | 2021-05-05 10:00 |
| 2 | B | 2021-05-05 11:00 |
| 3 | C | 2021-05-05 12:00 |
| 4 | C | 2021-05-05 13:00 |
"""
# Query feast
training_df = store.get_historical_features(
entity_df=entity_df,
feature_refs=[
"payment_facts__amount",
"company_statistics__total_transactions_last_7d",
],
)
"""
training dataset:
| payment | company | payment__amount | company__total_transactions_last_7d | timestamp |
|--------:|---------|-----------------|-------------------------------------|------------------|
| 1 | A | 10 | 1 | 2021-05-05 10:00 |
| 2 | B | 20 | 1 | 2021-05-05 11:00 |
| 3 | C | 10 | 2 | 2021-05-05 12:00 |
| 4 | C | 30 | 2 | 2021-05-05 13:00 |
"""
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is your feature request related to a problem? Please describe.
In our current setup, we use the following pipelines to construct the different sources and train a model on transaction payment data:
company_statistics:transaction_volume_last_7d
payment_facts:amount
entity_df
-> Feast join to get training dataset for features:["company_statistics:transaction_volume_last_7d", "payment_facts:amount"]
.What we now see is that the "factual feature" that we have saved in the
payment_facts
source table is oftentimes also available in the same row when we query the same source forentity_df
. This means that currently we explicitly drop these factual columns (amount
) from the query that returns ourentity_df
, only to query them back using feast during model training.This works, but painpoints are that:
entity_df
andpayment_facts
rows match exactly. Otherwise we miss the fact during the Feast join (we solve this by making sure that we always query the same source tabletransaction
).Benefits to current approach and saving features explicitly with their own sources include:
entity_df
is not strictly defined. Not defining a feature view for factual feature means those features are not defined in the feature store anywhere.Describe the solution you'd like
Would like to discuss what we can do to avoid this pattern where we already have the data on the
entity_df
, but then drop those columns fromentity_df
, only to query them back from the offline feature store later.Some options I see include:
entity_df
directly from the source defined for theFeatureView
.FeatureView
sources which are also used as a source of truth for entities during for training. Would prefer to use the actual source of truth (that is the origin of the source during ETL,transaction
in this case) to construct theentity_df
.FeatureService
or create a new object that captures the information that there are different types of features that are "provided" with the query to the store.entity_df
.It seems to me that 3) could be a viable approach. We could change the interface (See constructor call
FeatureService
) to look something like the following:What this changes is that we:
OnDemandFeature
with aFeatureView
, but still register it somewhere, namely with theFeatureService
.Describe alternatives you've considered
See proposed option 1) and 2).
Additional context
Example existing flow, saving data and then retrieving it again. See difference proposed flow in where we query for
entity_df
.Curious to hear any thoughts on this.