feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.47k stars 977 forks source link

Separate transformation and materialization #4365

Closed franciscojavierarceo closed 4 weeks ago

franciscojavierarceo commented 1 month ago

Is your feature request related to a problem? Please describe. As briefly mentioned in https://github.com/feast-dev/feast/issues/4277, our current structure of having feature view decorators with a naming convention that references the ingestion and transformation pattern is confusing.

Transformation and Materialization are two separate constructs that should be decoupled.

Feature Views are simply schema definitions that can be used online and offline and historically did not support transformation. We should change this.

As a concrete, simple example suppose a user had a Spark offline store and MySQL online store using the Python feature server.

Suppose further that the user of the Feast service had 3 sets of data that required 3 different write patterns:

  1. Batch data from some scheduled Spark Job that returned as output a large parquet file with some entity key and some features that is to be materialized to the online store.
  2. Streaming data sent by an asynchronous Kinsesis/Kafka event to Feast to be pushed to the online store.
  3. Online data sent through a synchronous api call to the online store (e.g., to the write-to-online-store endpoint).

Cases (1) and (2) are asynchronous and have no guarantees about the consistency of the data when a client requests those features but (3), if explicitly chosen to be a synchronous write, would have much stronger guarantees about the consistency of the data.

If Feature Views allowed for Feature Transformations before writes, then the current view of Feature Views representing Batch Features alone breaks down. This poor clarity is rooted in the mixture of transformations and materializations. Transformations can happen as a part of a batch job, a streaming pipeline, or during an api call by different computation engines (Spark, a Flink Application, or a simple python microservice). Materialization can technically be done independently of the computation engine (e.g., the output of a spark job can be materialized to the online store using something else).

If we want to enable Feature Views to allow for transformations, it no longer only represents a batch feature view so adding a decorator (as proposed in https://github.com/feast-dev/feast/issues/4277) to represent that would be confusing.

Describe the solution you'd like We should update Feast to use a transform decorator and the write patterns should be more tightly coupled with the Feature View type. For example, Stream, On Demand, Batch, and regular Feature Views could all use the same transformation code but offer different guarantees about how the data will be written (Stream: asynchronously, On Demand: not at all, Batch: asynchronously, and Feature View: synchronously).

Describe alternatives you've considered N/a

Additional context @tokoko @HaoXuAI @shuchu what do you think here? My thoughts here aren't perfectly fleshed out but it's something that I've been thinking about and trying to find the way to articulate it well.

tokoko commented 1 month ago

@franciscojavierarceo Hey, thanks for starting a discussion on this. Two initial questions from me so far:

This poor clarity is rooted in the mixture of transformations and materializations.

I don't really get why you think this is the case. I guess they are sort of mixed in the streaming engine, is that what you're referring to? As far as batch materialization goes, materialization right now doesn't do any transforms whatsoever (it will after we introduce BatchFeatureViews) and they are imho properly decoupled. get_historical_features does transforms, but no materialization for example.

We should update Feast to use a transform decorator and the write patterns should be more tightly coupled with the Feature View type.

Again, I don't understand why we're talking about write patterns here. The way I think about it transformations don't have anything to do with materialization directly. Different types of Feature Views specify at what stage in feast workflow transformations should be applied (and by what component):

  1. Transformations in the proposed BatchFeatureView are applied exclusively by the offline store. This can be during a get_historical_features call or also when materialization engine calls on offline store to prepare the dataset to be written to the online store. This also means that you have all the data present in the sources at your disposal and can write transformations as complicated as you want.
  2. Transformations in OnDemandFeatureView are applied by both offline store and online store only as part of get_historical_features and get_online_features calls, respectively. They are more versatile in that sense, but this also means that odfv transformations can only be row-level as online store has access to latest features only.
  3. Transformations in StreamFeatureView are applied only as part of a streaming engine.

Lastly, I don't understand how a generic transform decorator will result in tighter coupling between transformations and feature view types than the current approach of each having a separate decorator. Can you provide some example usage of transform decorator here?

franciscojavierarceo commented 1 month ago

As far as batch materialization goes, materialization right now doesn't do any transforms whatsoever (it will after we introduce BatchFeatureViews) and they are imho properly decoupled. get_historical_features does transforms, but no materialization for example.

Correct but if we follow the existing on_demand_feature_view and stream_feature_view, then we would have batch_feature_view, and I don't know if that actually makes sense when we want to support a transformation "on demand" that persists the output. The difference is that we want the "on demand" transformation to happen when the data producer pushes to Feast, not when the client requests the feature.

What I'm really after is a good representation of that use case; i.e., "Compute features on demand and write them for efficient retrieval". At the moment that can be done via a ODFV+FV (where you write to the FV using the output of the ODFV) but really this should just be a transformation before the FV (like what is done with the ODFV code) is written to the online store.

franciscojavierarceo commented 1 month ago

Again, I don't understand why we're talking about write patterns here. The way I think about it transformations don't have anything to do with materialization directly. Different types of Feature Views specify at what stage in feast workflow transformations should be applied (and by what component):

  1. Transformations in the proposed BatchFeatureView are applied exclusively by the offline store. This can be during a get_historical_features call or also when materialization engine calls on offline store to prepare the dataset to be written to the online store. This also means that you have all the data present in the sources at your disposal and can write transformations as complicated as you want.
  2. Transformations in OnDemandFeatureView are applied by both offline store and online store only as part of get_historical_features and get_online_features calls, respectively. They are more versatile in that sense, but this also means that odfv transformations can only be row-level as online store has access to latest features only.
  3. Transformations in StreamFeatureView are applied only as part of a streaming engine.

Let's look at the docs:

From the feature view page:

A feature view is an object that represents a logical group of time-series feature data as it is found in a data source. Depending on the kind of feature view, it may contain some lightweight (experimental) feature transformations (see [Alpha] On demand feature views). ... Feature views are used during

The generation of training datasets by querying the data source of feature views in order to find historical feature values. A single training dataset may consist of features from multiple feature views.

Loading of feature values into an online store. Feature views determine the storage schema in the online store. Feature values can be loaded from batch sources or from stream sources.

Retrieval of features from the online store. Feature views provide the schema definition to Feast in order to look up features from the online store.

From the stream feature view page:

A stream feature view is an extension of a normal feature view. The primary difference is that stream feature views have both stream and batch data sources, whereas a normal feature view only has a batch data source.

Stream feature views should be used instead of normal feature views when there are stream data sources (e.g. Kafka and Kinesis) available to provide fresh features in an online setting.

And the data source is described as:

A data source in Feast refers to raw underlying data that users own (e.g. in a table in BigQuery). Feast does not manage any of the raw underlying data but instead, is in charge of loading this data and performing different operations on the data to retrieve or serve features. ...

  1. Batch data sources: ideally, these live in data warehouses (BigQuery, Snowflake, Redshift), but can be in data lakes (S3, GCS, etc). Feast supports ingesting and querying data across both.

  2. Stream data sources: Feast does not have native streaming integrations. It does however facilitate making streaming features available in different environments. There are two kinds of sources:

    • Push sources allow users to push features into Feast, and make it available for training / batch scoring ("offline"), for realtime feature serving ("online") or both.

    • [Alpha] Stream sources allow users to register metadata from Kafka or Kinesis sources. The onus is on the user to ingest from these sources, though Feast provides some limited helper methods to ingest directly from Kafka / Kinesis topics.

  3. (Experimental) Request data sources: This is data that is only available at request time (e.g. from a user action that needs an immediate model prediction response). This is primarily relevant as an input into on-demand feature views, which allow light-weight feature engineering and combining features across sources.

Going back to my goal of finding a way to represent "Compute features on demand and write them for efficient retrieval", I don't think the existing conventions make this obvious.

What do you think? Any thoughts on either a naming convention? Maybe Batch Feature View is fine and maybe we add a new FV construct to represent what we want like Persistent On Demand or something else?

franciscojavierarceo commented 1 month ago

Lastly, I don't understand how a generic transform decorator will result in tighter coupling between transformations and feature view types than the current approach of each having a separate decorator. Can you provide some example usage of transform decorator here?

Because transformations are independent of materialization. For example, with a Spark offline store and MySQL online store you could use the same UDF to create your historical features and execute them in the use case I outlined.

tokoko commented 1 month ago

What do you think? Any thoughts on either a naming convention? Maybe Batch Feature View is fine and maybe we add a new FV construct to represent what we want like Persistent On Demand or something else?

Sure, this probably comes down to naming. I'm not a big fan of BatchFeatureView either, although Tecton uses the term, so at least for some people that might be more familiar. I dislike Persistent On Demand too because (depending on how we implement it ) sometimes transformations in BatchFeatureView will run during get_historical_features call w/o persistence. I think persistent only makes sense for online store queries. Maybe something like DerivedFeatureView or VirtualFeatureView is better (?), but to be fair even odfvs are in a way virtual, so that might end up being even more confusing.

franciscojavierarceo commented 1 month ago

What about PrecomputedFeatureView? 🤔

franciscojavierarceo commented 1 month ago

Here are 17 suggestions from ChatGPT:

  1. Cached Transform Feature View
  2. Computed Feature View
  3. Computed Storage View
  4. Computed Transform Feature View
  5. Persisted Computation Feature View
  6. Persisted Transform Feature View
  7. Persistent Feature View
  8. Persistent Transform Feature View
  9. Precomputed Feature View
  10. Precomputed Persistent Feature View
  11. Precomputed Storage Feature View
  12. Stored Computed Feature View
  13. Stored Transform Feature View
  14. Stored Transformation Feature View
  15. Transform and Store Feature View
  16. Transformation Feature Feature View
  17. Transforming Feature View
franciscojavierarceo commented 1 month ago

Here are the ones I like:

  1. Computed Feature View
  2. Transformed Feature View
  3. Precomputed Feature View
  4. Persistent Transform Feature View
  5. Transformation Feature View
HaoXuAI commented 1 month ago

From my perspective, materialization shouldn't do the transformation, but just load the data from offline to online.

The transformation should be handled before materialization run on on the user's compute. In that way, the transformation udf of the feature view is only a registry for the user to use during their transformation.

It's probably more clear if we introduced the transformation engines (or sth like processor) into Feast. And BatchFeatureView and StreamFeatureView are not confusing anymore. They are transformation logic run on engine to load data from stream(StreamFeatureView)/batch(BatchFeatureView) source to offline/online store.

OnDemandFeatureView is special, I would think of it as "online" feature view, while stream/batch are offline feature view. The transformation of OnDemandFeatureView is pretty much up to the user's run environment.

franciscojavierarceo commented 1 month ago

From my perspective, materialization shouldn't do the transformation, but just load the data from offline to online.

Agreed.

The transformation should be handled before materialization run on on the user's compute. In that way, the transformation udf of the feature view is only a registry for the user to use during their transformation.

It's probably more clear if we introduced the transformation engines (or sth like processor) into Feast. And BatchFeatureView and StreamFeatureView are not confusing anymore. They are transformation logic run on engine to load data from stream(StreamFeatureView)/batch(BatchFeatureView) source to offline/online store.

Agreed. I like the idea of creating a "Transformation Engine" construct, though I think for the OnDemandFeatureView case it gets confusing because it is basically the Feast Feature Server.

OnDemandFeatureView is special, I would think of it as "online" feature view, while stream/batch are offline feature view. The transformation of OnDemandFeatureView is pretty much up to the user's run environment.

Currently, "On Demand" means "do this transformation at request time" and that language is fairly consistent for StreamFeatureViews since "StreamFeatureView does the transformation when the stream is consumed".

For a BatchFeatureView the language is analogous in the sense that the transformation would happen during the process of the batch data.

For this new use case, the language gets a little more opaque in my opinion.I suppose that's why I like PrecomputedFeatureView.

franciscojavierarceo commented 1 month ago

If we think of it in terms of what happens at run time:

flowchart LR
D[On Demand] --> |`get_online_features` | E(Dynamic Computation + Retrieval)
A[Batch] --> |`get_online_features` | B(Pure Retrieval)
C[Stream] --> |`get_online_features` | B(Pure Retrieval)
F[Precomputed] --> |`get_online_features` | B(Pure Retrieval)

And so I think the clarity is worthwhile and valuable.

HaoXuAI commented 1 month ago

Personally not that into the "PrecomputedFeatureView", it's not a standard or widely used thing in industry... Also the Batch and Feature has nothing to do with the get_online_features, isn't it?

franciscojavierarceo commented 1 month ago

Personally not that into the "PrecomputedFeatureView", it's not a standard or widely used thing in industry... Also the Batch and Feature has nothing to do with the get_online_features, isn't it?

I think we have to steer the industry here as this use case is needed but not well understood.

Other feature stores settle on "On-Demand" or "Streaming" see Feature Form as another example as well as Tecton and Databricks.

The precomputed relies on a synchronous api call from the client to write to the feature store, so that's really the only difference, otherwise it would behave like a StreamFeatureView, so I understand why everyone may have settled on "just use streaming" but it turns out that's not super clear. This was a thing we worked with a lot at Affirm fwiw, we need strictly synchronous writes and wanted to block the client until the features were computed.

franciscojavierarceo commented 1 month ago

After discussion and feedback from the @HaoXuAI and @tokoko the agreed upon solution is to make the write option configurable in the On Demand Feature View and declare it in the decorator.

dandawg commented 1 month ago

I just read through the notes--fascinating discussion here. I have some thoughts on transformations and UX.

I like the idea of a transform method implementation on a FeatureView, that makes chaining (or more complex DAG operations) possible and more intuitive to code.

my_feature1 = FeatureView().transform(<udf-here>)
my_feature2 = my_feature1.transform(<another-udf>)

The transform function would output another FeatureView, and have the option of persisting the transformation to a store (or possibly locally in memory)(and possibly lazily).

This would make it easier to support "feature pipelines" later.

It would also be more intuitive to keep transformation functionality centered in the transformation method (rather than in other parts of the FeatureView class). (The idea of a FeatureView as a "noun", and transform as a "verb").

Last thought would be considering different types of transforms (Spark, Python func, Ray, Dask, etc.) and supported options (do we have several transform methods, or just one generic?).

Curious on your thoughts.

franciscojavierarceo commented 1 month ago

This would make it easier to support "feature pipelines" later.

Cool idea!

It would also be more intuitive to keep transformation functionality centered in the transformation method (rather than in other parts of the FeatureView class). (The idea of a FeatureView as a "noun", and transform as a "verb").

This was my thinking with the decorator approach.

Last thought would be considering different types of transforms (Spark, Python func, Ray, Dask, etc.) and supported options (do we have several transform methods, or just one generic?).

We would be and we would declare this in the decorator via the mode parameter. See this PR. We could make the very extensible so that chaining could behave per mode.

tokoko commented 1 month ago

@dandawg you should probably check #4277 out as well. The syntax should probably follow odfv pattern instead, but sure, the idea is the same.

Last thought would be considering different types of transforms (Spark, Python func, Ray, Dask, etc.) and supported options (do we have several transform methods, or just one generic?).

As Francisco pointed out we have a concept of mode in odfvs for this (currently pandas, python and substrait). I'm personally against the idea of introducing more specific modes (for spark, ray, dask and so on) because that would imply that transformation would become usable for single offline store engine only (can't run dask in spark engine). Instead I'm hoping for using substrait mode (which actually uses ibis) for those scenarios. A user describes transformations in ibis and each offline store engine uses ibis library to execute those transformations on respective backends. The upside is that you could in theory switch out offline store engine w/o rewriting transformations.

franciscojavierarceo commented 1 month ago

I'm personally against the idea of introducing more specific modes (for spark, ray, dask and so on) because that would imply that transformation would become usable for single offline store engine only (can't run dask in spark engine). Instead I'm hoping for using substrait mode (which actually uses ibis) for those scenarios. A user describes transformations in ibis and each offline store engine uses ibis library to execute those transformations on respective backends.

I hear you, I think there are pros and cons to this approach. Since you implemented ibis already, I think we should share this with folks but if someone wants to contribute a mode in Ray, we shouldn't turn them away imo.

dandawg commented 1 month ago

I love support for ibis, as long as we don't force users to have to learn ibis (I know, it's easy, but there are lots of reasons developers may want/need to use specific tools).

Also, I absolutely love odfv functionality. I'm glad we have it, but the current pattern makes it hard to understand where/how/when transformations are happening. I'd love for it to be more explicitly ordered and composable. In this example (from the docs), the transform and feature itself are conceptually convolved, which is somewhat confusing.

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
        "transformed_conv_rate:conv_rate_plus_val1",
        "transformed_conv_rate:conv_rate_plus_val2",
    ],
).to_df()

Above, we have "transformed_conv_rate". To know that this is an odfv, we have to traverse the code. In most IDEs, if it was a python method, we could right click and view the definition and all references where the transform is used (but I can't if it's a string). I'm also not as confident about stack traces being able to reference the right line of code in the even of an error.

I realize we may not want to address all of these issues here, but I wanted to comment on them for visibility.

tokoko commented 1 month ago

I love support for ibis, as long as we don't force users to have to learn ibis (I know, it's easy, but there are lots of reasons developers may want/need to use specific tools).

Sure, unless ibis itself becomes a lot more popular in the future, it's unlikely we can get away with just ibis. Another easy-to-maintain alternative is to have a generic sql mode with users being responsible that SQL queries provided match up with the desired dialect that the engine expects. Or we can have more engine-specific modes as well, of course.

Above, we have "transformed_conv_rate". To know that this is an odfv, we have to traverse the code. In most IDEs, if it was a python method, we could right click and view the definition and all references where the transform is used (but I can't if it's a string). I'm also not as confident about stack traces being able to reference the right line of code in the even of an error.

I hate to use "it's a feature, not a bug" argument here, but there are good reasons for this separation (and API expecting strings as arguments) due to feast architecture. Even though feast has a "definitions as python code" approach, object definition and actual execution are completely decoupled. In other words, when an "administrator" runs a feast apply, all relevant information is stored in the registry and feast calls (get_online_features, get_historical_features) only use the registry during execution, not the python files from which registry metadata was generated from.

In a realistic production setting, those two environments may have nothing in common in fact, python functions won't even be directly accessible most of the times, instead they are serialized with dill during feast apply, stored in the registry and retrieved/deserialized from the registry during the execution. The alternative would be for users to be responsible that codebase from the "definition" side and "execution" side is exactly the same. Just to compare and contrast... Not sure how familiar you're with Airflow, but it has exactly this restriction, meaning that python code structure present in dag-processor (dag definitions) needs to match up exactly with the one in worker/executor (dag execution) and it's up to the user to manage it (by filesystem mounting, using the same docker image for both or some other method).

franciscojavierarceo commented 1 month ago

I hate to use "it's a feature, not a bug" argument here, but there are good reasons for this separation (and API expecting strings as arguments) due to feast architecture. Even though feast has a "definitions as python code" approach, object definition and actual execution are completely decoupled. In other words, when an "administrator" runs a feast apply, all relevant information is stored in the registry and feast calls (get_online_features, get_historical_features) only use the registry during execution, not the python files from which registry metadata was generated from.

+1

There are ways to treat the UDFs as actual code but that has important consequences to the feature server behavior.

franciscojavierarceo commented 1 month ago

In a realistic production setting, those two environments may have nothing in common in fact, python functions won't even be directly accessible most of the times, instead they are serialized with dill during feast apply, stored in the registry and retrieved/deserialized from the registry during the execution.

Correct. In my previous role, we actually did couple them and the big consequence was that new features required a full deployment, which was costly in time.

franciscojavierarceo commented 1 month ago

https://github.com/feast-dev/feast/issues/4376 will solve this.