feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.47k stars 977 forks source link

Identify State of Feature Views (Current use case is during materialization and stream ingestion) #4440

Open EXPEbdodla opened 3 weeks ago

EXPEbdodla commented 3 weeks ago

Is your feature request related to a problem? Please describe. We are using Feast at a larger scale. We have multiple users using the Feast with single registry with multiple projects. Each project is associated to a Team. As registry scaling up, we would like to understand how the users are using Feature Store and what is state of the feature views. We would like to understand Are there any active materializations going on. What is the materialization window it's using and how many records it's trying to UPSERT to online stores. I see the materialization_interval field on feature view, as we continue running materialization on a daily basis, that field would become a bottleneck soon during the serialization and deserialization. We need to have a right way to know the status and log the materialization history information.

This feature may be needed for other types of feature views like Stream Feature View.

Describe the solution you'd like Solutions:

  1. Introduce a materialization state field to Feature View to know the current status of Feature View
  2. Get method to retrieve the active materializations in project and also across all projects
  3. Log table in SQL Registry to understand the materialization events. Materialization Job ID and associate with project, feature view,, Start time of Materialization Interval, end time of Materialization Interval, Start Time of the Job, End Time of the Job and Number of records written to online store during the interval
  4. Use materialization_interval field to show only Last N materialization intervals

Describe alternatives you've considered No alternative at this point. Reaching out to users to understand if any materialization jobs running. As user base increases, it's hard to get hold of each one to understand what's happening.

Additional context NA

EXPEbdodla commented 3 weeks ago

@franciscojavierarceo FYI

franciscojavierarceo commented 3 weeks ago

This is a good idea. @tokoko

tokoko commented 2 weeks ago

Introduce a materialization state field to Feature View to know the current status of Feature View

What do you consider state to be here? Is it always the last upper bound of materialization window? or maybe some arbitrary user-defined data?

Log table in SQL Registry to understand the materialization events. Materialization Job ID and associate with project, feature view,, Start time, end time, Number of records written to online store during the interval

I feel this is the most problematic one here because of file-based registries, where it will be considerable harder to accomplish the same.

Another alternative I've mentioned before is to support this sort of materialization log, but move the APIs for it to online store instead of the registry. The major benefit of online store-managed materializations is that supporting multiple online stores at once (much requested feature) will become a lot easier, plus we won't run the risk of bloating registry accidentally. wdyt?

franciscojavierarceo commented 2 weeks ago

This can get complicated either way we do it.

From a first principles perspective, this is metadata and metadata belongs in the registry because that's what's intuitive to users...but that can overload the registry and result in OOMs when caching the registry.

We could support both and make it configurable by the user. We could also only store the most recent materialization metadata for each feature view by default and warn about memory issues if someone configures file based and full metadata history for materialization.

EXPEbdodla commented 2 weeks ago

For me state is mainly Materialization is going on currently or not. This can help to avoid parallel execution of Materializations when an active materialization is going on. Log table will have the additional details of Materialization.

Agree with @franciscojavierarceo mentioned, this is primarily metadata information which is suitable to store in registry only and storing only latest materialization information on materialization_interval rather than storing all interval information.

It can be an optional feature to some of the registry's as an option.