feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.57k stars 996 forks source link

Feature Tables vs Feature Views #1583

Closed woop closed 2 years ago

woop commented 3 years ago

Creating this issue to discuss some concepts in Feast, Feature Sets, Feature Tables and Feature Views.

Feature Sets

Prior to Feast 0.8, Feast had a concept of Feature Sets (not to be confused with the new Feature Set RFC). Feature sets were logical groups of features that occured together. These groups of features share an entity (which can be composite) and in the offline case they also share timestamps. For example, a feature set could be used to store a log of events, or it could be used to store the results of an aggregation. The idea is that different processes (stream or batch ETLs) would output data into their own tables, and Feast would join these different tables during retrieval. Therefore feature sets avoid a sparse table problem.

Importantly, Feature Sets did NOT have a source. Users were always asked to push data to the feature store. For batch ingestion, the users did the following

client.ingest("my_feature_set", my_pandas_dataframe)

For stream ingestion, teams would push to a specific topic for a feature set.

The feature store would provide both offline and online storage of user data, and allowed users to imperatively load features into the feature store. Feature sets made the feature store into the source of truth for feature data. Users would ingest from both their notebooks as well as their batch or streaming ETL pipelines.

Feature Tables

In Feast 0.8, we replaced Feature Set with Feature Table. The main reason was scoping. Many teams already have data being stored in specific locations like data warehouses and lakes. This allows Feast to materialize (load) data from outside the feature store into the feature store for storage and serving, and means that Feast doesn't have to become the source of truth for feature data (it lives externally). Feast would not create or manage the offline store in this case, unlike in Feast 0.7 and before.

The idea was not that Feast would never provide an offline store. The primary reason we did not start with managing an offline store was because the source-centric approach scoped down the project and allows us to address most use cases.

Because we had ingest() for feature sets in Feast 0.7, we had to provide backward compatibility for teams that wanted to ingest data from ETL pipelines. In order to do so, we still provided the ingest() functionality. However, this pushed directly to the source location, not into the offline store. The point of this ingest() was only to provide a migration path to the new Feast (0.8, 0.9), not to be a long term API to exist alongside sources. In fact, pushing directly to a source is an anti-pattern since its often the case that teams do not have write access to sources.

Feature Views

Feature Views were introduced in Feast 0.10. Feature views can be thought of as

Feature views in Feast 0.10 function the same as with feature tables in 0.9, but we do not allow direct ingestion to a feature view's source. The feature view can be "materialized", which pulls from the source and loads the data into the feature store. Right now we only materialize into the online store since we are able to query the batch source directly in order to build training datasets.

Note: Feast is not only concerned with loading data into an online store. Feature views only dictate that the source of data lives externally to the feature store, but there is a case for materialization into both an online and offline store in theory. The use case for the offline store is

Feature Tables (potential reintroduction)

Now that feature views have a clear purpose, we are considering introducing feature tables to address the previously removed ingestion functionality. The use case is the same as feature sets in Feast 0.7.

Users have data in their ETL pipelines or Jupyter notebooks, and they need a structured location to store that data for consumption in models. Feature tables would allow them to load and store their data in the feature store, thereby becoming the source of truth for this feature data. This solves the following problems.

Pseudo code

# load dataframe
df = pd.read_csv("my_data.csv")

# create feature table
ft = FeatureTable.infer_from_df(df)

# register feature table
fs.apply(ft)

# ingest/load data
ft.ingest(df)

Alternatives

An alternative proposed by @animeshsingh is to only use feature views and to ask users to always bring their own sources. Users would be responsible for uploading their data to a source location. The benefit of this approach is that we introduce less concepts to Feast and keep our APIs simpler.

animeshsingh commented 3 years ago

Thanks @woop - great writeup. Main point I had was all the data movement with transformation and engineering can be done using the concept of Feature Views. So e.g.

  1. From external source -> offline store and/or online store
  2. From offline store -> online store

So Features Definitions consisting of Feature Views becomes the DSL/Metadata contract to define

  1. From where to move Data (e.g. from external to feast internal, from feast internal (offline store as source))
  2. What subset (view) of Data to move
  3. What to use for ETL (feast default with materialize or user provided pipeline)
  4. Where to move data (Offline or Online, or both)

So FeatureSets still makes sense, though i think we can just live with one vis a vis View or Table

woop commented 3 years ago

Next steps here are to create a proposal for both approaches

FeatureServices/Sets are out of scope.

rakshithvsk commented 3 years ago

Hey @woop,

Great info. Thanks for the detailed explanation of concepts in different versions. I have recently migrated from "feast 0.9.3" to "feast 0.10.8" and have few questions after using FeatureViews 1) All I see is that FeatureView revolves around the FeatureRepo directory hence with the introduction of FeatureViews, are we planning to remove dependency on Feast Core, Serving and Feast Spark? 2) Also I don't see start_stream_to_online_ingestion in FeatureView, which was available in Client. So should I still depend on FeaturTable for online ingestion? 3) With Redis coming into the picture for an online store, do we have plans of providing Redis as an option in feature_store.yaml file?

After migrating to feast 0.10.8, these were some of the questions to which I couldn't find the answer to. Let me know your thoughts @woop

woop commented 3 years ago
  • All I see is that FeatureView revolves around the FeatureRepo directory hence with the introduction of FeatureViews, are we planning to remove dependency on Feast Core, Serving and Feast Spark?
  • Also I don't see start_stream_to_online_ingestion in FeatureView, which was available in Client. So should I still depend on FeaturTable for online ingestion?

@rakshithvsk these are all very valid questions, but they are off topic for this specific thread. Can you please move the discussion to #1527?

rightx2 commented 3 years ago

@woop Question. Will the FeatureTable be deprecated in the future?

rakshithvsk commented 3 years ago
  • All I see is that FeatureView revolves around the FeatureRepo directory hence with the introduction of FeatureViews, are we planning to remove dependency on Feast Core, Serving and Feast Spark?
  • Also I don't see start_stream_to_online_ingestion in FeatureView, which was available in Client. So should I still depend on FeaturTable for online ingestion?

@rakshithvsk these are all very valid questions, but they are off topic for this specific thread. Can you please move the discussion to #1527?

Thanks for the reply @woop. I've moved the discussion to over here:- https://github.com/feast-dev/feast/issues/1527#issuecomment-874488877

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

judahrand commented 2 years ago

I think this is still worth talking about - should FeatureTables be removed from the codebase? They're currently untested anywhere so it seems to be asking for trouble to keep them around. It also makes it harder to rework the data types and both FeatureTables and FeatureViews need to be updated. What do you think @woop?

They can always be added back in a well considered way later for ingestion if required.