feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.61k stars 1k forks source link

Features already present on entity_df queried back from FeatureView source #1773

Open Joostrothweiler opened 3 years ago

Joostrothweiler commented 3 years ago

Is your feature request related to a problem? Please describe.

In our current setup, we use the following pipelines to construct the different sources and train a model on transaction payment data:

  1. Feature engineering ETL: "transaction" table -> query and save feature values to source for company_statistics:transaction_volume_last_7d
  2. Feature engineering ETL: "transaction" table -> query and save feature values to source for payment_facts:amount
  3. Model training: "transaction" table -> query entity_df -> Feast join to get training dataset for features: ["company_statistics:transaction_volume_last_7d", "payment_facts:amount"].

What we now see is that the "factual feature" that we have saved in the payment_facts source table is oftentimes also available in the same row when we query the same source for entity_df. This means that currently we explicitly drop these factual columns (amount) from the query that returns our entity_df, only to query them back using feast during model training.

This works, but painpoints are that:

  1. We have to perform additional unnecessary joins for such facts (and data gets duplicated).
  2. It is important that the event timestamps in entity_df and payment_facts rows match exactly. Otherwise we miss the fact during the Feast join (we solve this by making sure that we always query the same source table transaction).

Benefits to current approach and saving features explicitly with their own sources include:

  1. Single query interface for all types of features
  2. entity_df is not strictly defined. Not defining a feature view for factual feature means those features are not defined in the feature store anywhere.

Describe the solution you'd like

Would like to discuss what we can do to avoid this pattern where we already have the data on the entity_df, but then drop those columns from entity_df, only to query them back from the offline feature store later.

Some options I see include:

  1. To not define these "factual" features in the feature store, and only define the "aggregate" features in the store.
    • Downside is that there is no definition anywhere of what those features are. This means the feature store will not have a complete view of what features exist in the platform anymore.
  2. Query the entity_df directly from the source defined for the FeatureView.
    • Currently this does not seem to fit very well with existing objects, as it would mean that you have "special" FeatureView sources which are also used as a source of truth for entities during for training. Would prefer to use the actual source of truth (that is the origin of the source during ETL, transaction in this case) to construct the entity_df.
  3. To extend the FeatureService or create a new object that captures the information that there are different types of features that are "provided" with the query to the store.
    • This means that they are documented at least somewhere in the feature store.
    • These features are also oftenthe ones that we will not lookup from the online store during serving (since they are directly provided on the user request). This means that also during serving, they are in a way "provided" on demand, similar to how they might be "provided" as part of the entity_df.
    • Putting a reference to the term "OnDemandFeature" from the FeatureService RFC here since I believe it might be used to describe the same concept as what I mean here with "provided".

It seems to me that 3) could be a viable approach. We could change the interface (See constructor call FeatureService) to look something like the following:

"""
transaction table:
| payment | company | amount | timestamp        |
|--------:|---------|--------|------------------|
|       1 |       A |     10 | 2021-05-05 10:00 |
|       2 |       B |     20 | 2021-05-05 11:00 |
|       3 |       C |     10 | 2021-05-05 12:00 |
|       4 |       C |     30 | 2021-05-05 13:00 |
"""

company_stats_fv = FeatureView(
    name="company_stats",
    entities=["company"],
    features=[
        Feature(name="total_transactions_last_7d", dtype=ValueType.FLOAT),
    ]
)
"""
company_stats_fv source after ETL:
| company | total_transactions_last_7d | timestamp        |
|---------|----------------------------|------------------|
|       A | 1                          | 2021-05-05 15:00 |
|       B | 1                          | 2021-05-05 15:00 |
|       C | 2                          | 2021-05-05 15:00 |
|       C | 2                          | 2021-05-05 15:00 |
"""

# Query transaction table to get the relevant entities and on demand features for the model
entity_df = # SELECT payment, company, amount as payment__amount from transaction;
"""
entity_df result, contains a feature as well:
| payment | company | payment__amount | timestamp        |
|--------:|---------|-----------------|------------------|
|       1 |       A |              10 | 2021-05-05 10:00 |
|       2 |       B |              20 | 2021-05-05 11:00 |
|       3 |       C |              10 | 2021-05-05 12:00 |
|       4 |       C |              30 | 2021-05-05 13:00 |
"""

# Query feast with FeatureService
feature_service = FeatureService(
    name="team_model",
    features=[
        OnDemandFeature("payment__amount"),
        Feature("company__total_transactions_last_7d"),
    ],
)

training_df = store.get_historical_features(
    feature_service=feature_service, 
    entity_df=entity_df
)
"""
training dataset:
| payment | company | payment__amount | company__total_transactions_last_7d | timestamp        |
|--------:|---------|-----------------|-------------------------------------|------------------|
|       1 |       A |              10 | 1                                   | 2021-05-05 10:00 |
|       2 |       B |              20 | 1                                   | 2021-05-05 11:00 |
|       3 |       C |              10 | 2                                   | 2021-05-05 12:00 |
|       4 |       C |              30 | 2                                   | 2021-05-05 13:00 |
"""

What this changes is that we:

  1. Do not duplicate the data, and query it with the entity_df.
  2. Do not register the OnDemandFeature with a FeatureView, but still register it somewhere, namely with the FeatureService.

Describe alternatives you've considered

See proposed option 1) and 2).

Additional context

Example existing flow, saving data and then retrieving it again. See difference proposed flow in where we query for entity_df.

"""
transaction table:
| payment | company | amount | timestamp        |
|--------:|---------|--------|------------------|
|       1 |       A |     10 | 2021-05-05 10:00 |
|       2 |       B |     20 | 2021-05-05 11:00 |
|       3 |       C |     10 | 2021-05-05 12:00 |
|       4 |       C |     30 | 2021-05-05 13:00 |
"""

payment_facts_fv = FeatureView(
    name="payment_facts",
    entities=["payment"],
    features=[
        Feature(name="amount", dtype=ValueType.FLOAT),
    ]
)
"""
payment_facts source after running ETL:
| payment | amount | timestamp        |
|--------:|--------|------------------|
|       1 |     10 | 2021-05-05 10:00 |
|       2 |     20 | 2021-05-05 11:00 |
|       3 |     10 | 2021-05-05 12:00 |
|       4 |     30 | 2021-05-05 13:00 |
"""

company_stats_fv = FeatureView(
    name="company_stats",
    entities=["company"],
    features=[
        Feature(name="total_transactions_last_7d", dtype=ValueType.FLOAT),
    ]
)
"""
company_stats source after running ETL:
| company | total_transactions_last_7d | timestamp        |
|---------|----------------------------|------------------|
|       A | 1                          | 2021-05-05 15:00 |
|       B | 1                          | 2021-05-05 15:00 |
|       C | 2                          | 2021-05-05 15:00 |
|       C | 2                          | 2021-05-05 15:00 |
"""

# Query transaction table to get the relevant entities for the model
entity_df = # SELECT payment, company from transaction;
"""
entity_df:
| payment | company | timestamp        |
|--------:|---------|------------------|
|       1 |       A | 2021-05-05 10:00 |
|       2 |       B | 2021-05-05 11:00 |
|       3 |       C | 2021-05-05 12:00 |
|       4 |       C | 2021-05-05 13:00 |
"""

# Query feast
training_df = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=[
        "payment_facts__amount",
        "company_statistics__total_transactions_last_7d",
    ],
)
"""
training dataset:
| payment | company | payment__amount | company__total_transactions_last_7d | timestamp        |
|--------:|---------|-----------------|-------------------------------------|------------------|
|       1 |       A |              10 | 1                                   | 2021-05-05 10:00 |
|       2 |       B |              20 | 1                                   | 2021-05-05 11:00 |
|       3 |       C |              10 | 2                                   | 2021-05-05 12:00 |
|       4 |       C |              30 | 2                                   | 2021-05-05 13:00 |
"""

Curious to hear any thoughts on this.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.