Feast API: Feature references, concept hierarchy, and data model

feast-dev / feast

The Open Source Feature Store for Machine Learning

https://feast.dev

Apache License 2.0

5.61k stars 1k forks source link

Feast API: Feature references, concept hierarchy, and data model #479

Closed woop closed 4 years ago

woop commented 4 years ago

This issue is meant to be a discussion of the current Feast API as it relates to feature references, a key component of the user facing API. Additionally, it will also discuss the current data model and our concept hierarchy.

1. Background

The Feast user facing API and data model changed dramatically from 0.1 to 0.2+. The original intention was to simplify the API as much as possible and gradually evolve it as new user requirements available.

Two important reference documents on this topic are

2. Problem statement

The Feast API is evolving as more and more teams adopt the software and share their requirements with us. In most cases this means an expansion of the API, but in some cases it means a reversal.

With the introduction of projects into Feast (Feast Projects RFC), our API has evolved again. This change has affected feature references, the data model, and concept hierarchy.

The most critical feedback on this change has been that it introduces unnecessary complexity to address problems (isolation, namespacing, security), that could be solved in a different way.

3. Objective

The point of this GitHub issue is to settle our API for feature references, our concept hierarchy, and data model in such a way that we

Meet all our known requirements for future development
Minimize user facing changes and migration requirements
Maintain flexibility in accepting new user requirements and evolving our API

Put simply, we want to make sure that we are on the right path and make the necessary changes now when its least disruptive.

4. What are feature references?

Feature references (previously Feature Ids) are strings/objects within Feast that allows Feast and users of Feast to reference specific features. Feature references are primarily used as a means of indicating to Feast which features a user would like to retrieve.

Originally, feature references were defined as follows <feature-set>:<feature-name>:<feature-version> All parts of the above reference were required at the time.

Feature references have recently been updated (as part of the Projects RFC)

The move towards project namespaces now moves feature sets and features/entities into the following hierarchy Screenshot from 2020-02-18 10-23-19

Feature references are now defined as: <project>/<feature-name>:<feature-version>

The following constraints apply

Versions are optional. If no version is provided then the latest version of a feature is used.
Feature names must be unique within a project (even across feature sets within that project).
Entity names must be unique within a project (but can be reused across feature sets).

One of our primary motivations was to allow users to reference features directly by name. With versions becoming optional and allowing the project to be set externally, this is now possible. Users can provide features as a list of feature names

An example of feature references being used below (from the Python SDK):

online_features = client.get_online_features(
    feature_refs=[
        f"daily_transactions",
        f"total_transactions",
    ],
    entity_rows=entity_rows,
)

5. How are feature references used?

5.1 During online serving

During online serving the user will provide two sets of information to Feast during feature retrieval.

A list of feature references
A list of entities

Feast wants to construct a response object with all of the data from these features on all of these entities.

For example, if a user sends a request with a single feature reference as daily_transactions, Feast will attempt to add the missing information. It will add the project id (which currently must be provided by the user), it will then determine the feature set that contains that feature name, and then finally it will determine the latest version of the feature set in which the feature occurs.

Internally, Feast is left with something that resembles the following my_customer_project/my_customer_feature_set:daily_transactions:3

Since features are stored based on feature sets, Feast first converts the above into what we can informally define as a feature set reference, resembling the following <project>/<feature-set-name>:<feature-set-version> or tangibly my_customer_project/my_customer_feature_set:3

In the case of Redis, Feast will use the above feature set reference, along with the entities the user has provided, to construct a list of keys to look up. The responses from the database are then used to build a response object that is returned to the user.

5.2 During batch serving

The batch serving case is very similar to the online serving case, but with more complexity on queries and joins.

The user provides the following during batch retrieval

A list of feature references
A list of entities paired with timestamps

Feature references are converted into their full form, as well as used to create feature set references (as in online serving). In the case of BigQuery, the feature set reference maps directly to a table. For each feature set table that Feast needs to query features from, Feast runs a point in time correct query using the entities+timestamps for the specific feature columns. This produces a resultant table with the users requested feature data, over the timestamps and features, but one specific feature set.

Feast then uses the entity columns in each feature set table as a means of joining the results of these sub-queries into a single resultant dataframe.

5.3 During ingestion of data into stores

When loading data into Feast, data first needs to be converted into FeatureRow format and then pushed into a Kafka stream.

During this conversion to feature row form, it is necessary to set a field called feature_set with the feature set reference. To reiterate, the feature set reference looks something like: <project>/<feature-set-name>:<feature-set-version>

Ingestion jobs that pick up these rows are then able to easily identify the row as belonging to a specific project and feature set. The jobs then write all of these rows to all of the stores that subscribe to these feature sets.

6. Problems with the current implementation

6.1 Feature set versions are unnecessary:

The concept of feature set versions was introduced in order to allow users to reuse feature set names. However, they add additional complexity at both ingestion time as well as retrieval time. Users need to maintain a knowledge of the correct version of feature set to ingest data to and to retrieve data from. If they dont pin their retrieval to a specific version then they risk having their system go down at a version increment.

6.2 Projects could be unnecessary at the top of the concept hierarchy:

Projects as a concept was introduced to provide a means of

Isolation between users: Users can register the same feature sets and features within their own project namespace without conflicts arrising between users.
Access control: Projects provide a top level hierarchy that makes access control more convenient to implement
Ease of feature retrieval: By introducing naming constraints at the project level, it is easier to logically group and reference feature by name. Thus, projects provide a way of grouping based on retrieval where feature sets provide a means of grouping based on ingestion.

The problem with projects is that it introduces a layer into the concept hierarchy that makes Feast harder to understand and could be introducing unnecessary complexity. It's possible that all of the above requirements for introducing projects could be addressed while still maintaining feature sets as the top level concept.

6.3 Projects are a cause for code smell in the data model:

There are currently three locations where projects occur.

Ingestion (FeatureRows)
Stores (tables and keys)
Serving/retrieval (incoming queries)

The current approach has code smell in the fact that FeatureRows have to know their own identity. Today, having each FeatureRow know its own identify allows Feast to consume from topics that contain mixed feature sets (versions and names). Feast is able to differentiate FeatureRows from each other and can know how to interpret their contents based on a feature reference contained within the row.

However, In the case that Feast were to consume features from an external stream that it had no control over (not even the data model), Feast would not have the feature set reference conveniently available inside the event payload.

The second occurrence of projects is in the store. Tables are currently named according to projectName_featureSet_version. Projects are a necessity here since feature set names can be duplicated across projects. However, projects are not essential complexity in the same way a feature set is, and doesnt seem natural to encode into the data model itself.

6.4 Feature sets are a leaky abstraction:

Feature sets are a core part of the existing data model. Feature data is stored on a feature set within a feast store like Redis or BigQuery. In order to find the features a user is looking for, it is still necessary to determine the feature set they need from their feature reference. This seems to work at retrieval time since Feast Serving can maintain a cache of available feature sets (albeit introducing a new inefficiency during lookup). Two problems exist here:

There is a disconnect between how users are producing data (feature set references) and how users are consuming data (feature references). Users are loading in FeatureRows into feature sets, but they are querying out features from projects. Ideally these two concepts wouldn't be so distinct.
Currently, feature references are defined as follows: <project>/<feature-name>:<feature-version>. However, the concept of a feature-version doesn't exist. Feature are currently inheriting their version from their feature set. So right now a feature references still contain trace information about the parent feature set.

woop commented 4 years ago

7 Proposed changes

This section will contain proposed changes to Feast. These changes can serve as "straw men" to further the discussion.

7.1 Remove versions and migrate user data on changes

This change attempts to address problem (6.1 Feature set versions are unnecessary)

The removal of feature set version has been proposed and discussed in #386. After the introduction of mutable feature sets in Feast 0.6, there will no longer be a value to keeping versions. Users will be able to make changes to their existing feature sets and reuse the feature set name. This would be a major quality of life upgrade for our users.

7.2 Projects as a property of feature sets

This change attempts to address problems (6.2 Projects could be unnecessary at the top of the concept hierarchy, 6.3 Projects are a cause for code smell in the data model, 6.4 Feature sets are a leaky abstraction)

Any changes to the data model or user facing API would require migrations, so the following proposal should not be taken lightly. That being said, if we believe the current API needs to change then the change should be incorporated as soon as possible. With the removal of feature set versions in 0.6, teams will have to migrate their data in any case, which would be an opportune time to cement these changes.

7.2.1 Change 1: Make projects an attribute/property of features

The idea here is to remove projects at the top of the concept hierarchy

Feature sets would have to be globally unique by name
If no project is set, all features that are created enter a global project
If a project is set, each feature created will have its project set.
Retrieval based on feature set is allowed again.

Pros

The retrieval API is simplified and the leaky abstraction from (6.4) is removed.
Users don't need to think about or use projects if they want to.
Projects are still available as a retrieval level grouping.
No changes to data model required
Access control can still be allowed at the project level
It would allow us to either commit to projects or remove projects from the retrieval API, with less disruption to users and administrators.

Cons

More restrictive in terms of available feature set names that users can register since they are unique globally. This could also be a bigger pain if users eventually cannot see all feature sets due to access control.
Seems like a half-step towards a final API (project vs non-project).
Adds complexity in the amount of constraints that a user needs to think about (fields unique in feature set, feature set unique globally, fields unique in a project).

Feature reference changes Feature references could become <feature-set>:<feature> or <project>/<feature> or <project>/<feature-set>:<feature> or if a project is set <feature>

Data model changes: A global project would be used to prefix feature sets that arent within a specific project. global__feature_set_1 or my_project_1__feature_set_2

Concept hierarchy changes: Feature sets would be the top of the concept hierarchy. Users would not need to think about projects.

7.2.2 Change 2: Remove projects from data model

This is an extension of (Change 1).

This proposed change is to make projects purely a retrieval abstraction.

Projects would only be used for retrieval from serving. It would be used to identify feature sets and feature names.
Tables are stored based on feature set names (as in 0.3), without a project prefix.
Rows are ingested without project identifiers (as in 0.3)

Pros

Simplifies the data model and makes it easier for users to ingest data without having to know the project a feature set is associated with.
Will eventually allow a feature set and its features to be available in multiple projects

Cons

This would require a data migration and a change in the way features are ingested into Feast deployments.

Feature reference changes Same as (Change 1)

Data model changes:

Feature sets will be stored by name only in tables: feature_set_1 or feature_set_2
FeatureRows will be ingested with only a feature set name (assuming versions are removed)

Concept hierarchy changes: Same as (Change 1)

7.2.3 Change 3: Allow feature sharing between projects

This is an extension of (Change 2).

In (Change 1) projects would become an attribute of a feature. The goal is still to provide a convenient way for access control, isolation, and referencing features.

If features are still only unique up to a project or feature set level, then referencing features in different contexts (projects or domains) or directly by name will still be difficult. References would become that_other_teams_project/feature_name.

Instead of having a one-to-many mapping from project to features, this proposal is to make the mapping a many to many relationship. The same features can be found in multiple projects. This would allow a user to set their project and reference these features directly by name.

Pros

Allows a feature to be shared across projects, without having to reference the feature name with the prefix of the external project. This allows various contexts to be supported for each feature.

Cons

Feature reference changes Extension of (Change 2), but allow a single feature to be referenced directly as feature_name from various projects.

Data model changes: None, same as (Change 2)

Concept hierarchy changes: None, same as (Change 2)

7.2.4 Change 4: Consider renaming projects

This is an extension of (Change 3).

Given that features would occur in multiple projects, these projects would probably be logically grouped according to various contexts that are not necessarily related to user projects, instead they could be grouped arbitrarily. A potentially more intuitive name might be applicable for referencing features, such as feature groups, domains, or repositories.

This proposal would require more thought, and is probably safe to ignore for the time being.

7.2.5 Change 5: Remove feature references from feature rows

This change would attempt to address (6.4 Feature sets are a leaky abstraction)

Instead of having FeatureRows contain a "feature_set" field where producers should set the identity of a feature set, instead FeatureRows should be unique based on source location (table, topic). The means of identifying the source data should be contained within the feature set specification.

Pros

Allows FeatureRows to be ingested without producers having to provide the identity of that feature row. This makes publishing FeatureRows easier for producers.
More flexibility in changing feature sets and consuming the same sources of data from multiple Feast deployments.
Allows for eventually supporting non-FeatureRow data sources.

Cons

Topic explosion for batch ingestion. This change would probably require a different means of dynamic ingestion of batch data that doesnt require shared topics.

Feature reference changes None

Data model changes: None

Concept hierarchy changes: None

Wirick commented 4 years ago

So just to share a possibility from my experience and wheelhouse, the plan for feast in my org is to have a features repo that defines avro schemas for feature sets. The feature set schemas (similarly to all of our event schemas) are generated programmatically, along with python and go glue code, annotated with version for evolution purposes, and then applied on master merge to the feast clusters.

When a user wants to ingest features, they use the generated schema object to ingest, validate, and publish a dataframe (note as an organization we use "ts" instead of datetime, so this also abstracts this difference):

from pmfeatures.buyer import CustomerFeatures

def generate_features():
           ...
            customer_features = CustomerFeatures(
                    customer_uuid=features["customer_uuid"],
                    ts=features['ts'],
                    features=features
            )
            customer_features.publish()
            ...

We automatically annotate schema changes with an updated version, and enforce schema evolution rules (it seems we would want similar rules for updating feature sets in bigquery if you want to use the same table) to make sure schemas are forward compatible. If feast had an ability to specify the version, this is the one I would use. However, when ingesting features the version doesn't usually appear, and by enforcing schema evolution rules we can be sure that any serving code will work with updated schemas, since the only allowed operations are adding new nullable fields and relaxing the type of a field.

I mention this because we are adopting the confluent schema registry in our general kafka strategy so that we don't have to have schema information encoded in the body of the message, so it seems like it could be used to help solve the outlined issues about an event knowing about it's feature set (6.3).

Additionally, we have a concept of namespace in our schemas, and we use that in the feature set name, and I've found that most want the latest version of a feature set. it's for this reason that project and version seem safe to remove, perhaps by incorporating into the 7.2.1 Change 1. The first piece of utility code that I wrote for my feature set objects was a method that takes a list of features and annotates them with the latest version (calls feast core I believe)

woop commented 4 years ago

avro schemas for feature sets

We've run into this pain point before. Essentially the users have data in various formats and they need to map it to a schema in Feast. Feast now introduces its own format and supported data types, which is in some part driven by protos. But we could in theory move towards a model where the protos are limited to defining the Feast API, while the data still conforms to a standard like Avro or Arrow. I think that would make data handling somewhat easier. @ches, not sure if you have opinions here.

If feast had an ability to specify the version, this is the one I would use

We are removing versions as they stand right now. After that we won't have a way to set metadata on a feature set, but we will have a way to set it on a feature. I think this would be a strong use case for being able to capture metadata on a feature set.

Additionally, we have a concept of namespace in our schemas, and we use that in the feature set name, and I've found that most want the latest version of a feature set.

Is this simply a prefix to the name? Who defines these names?

it's for this reason that project and version seem safe to remove, perhaps by incorporating into the 7.2.1 Change 1

The million dollar question is whether projects are still valuable as a means of isolation. We could implement 7.2.1, but how would we deal with users wanting their own workspaces where they could create their own features sets and features. Name conflicts would be mysterious, because you wont be able to see the feature sets in another project that you have a conflict with.

I just want to be 100% sure that this change in the way we use projects is actually an improvement.

woop commented 4 years ago

Just an update on this issue. I'd like to delay a decision on this until we have higher adoption. The most conservative approach, and the one I think we will try to enforce for 0.5, is as follows:

Changes to Feast:

Feature references cannot contain a project when it is in string form.
Projects must be defined once per incoming request. If it isn't defined then the default project is used. The default project is just called, "default". This can be parametrized in the future.
Subscriptions can only depend on feature sets, not projects.
Feature names can again be duplicated in different feature sets.
Feature names do not have to be unique in a project.

Changes to policies around feature creation:

Only default project to be used for production systems. This ensures uniqueness of important feature sets and allows us to move towards either the "tag" based model in the future, or double down on the isolation model.
Non-default projects will be allowed for development purposes, but no dependencies should exist for production pipelines or online systems.

What does this buy us?

The option to either commit to the tag based approach or the project isolation based approach.
For most functional parts, everything stays the same. Users can still prefix their feature sets with project names if they want isolation.
Authorization functionality will have to be updated in order to support feature set level access control, but I am happy delaying that until we have a better understanding of what users want.

woop commented 4 years ago

I've spoken to quite a few folks over the last couple of weeks on this topic. It seems everyone wants something different so I would appreciate some input in order to get everyone aligned and to commit to an approach.

Background

Feast 0.3

Feature sets add complexity to feature references.
Users just want to work with feature names.

Feast 0.4

Removed feature sets from feature references, but added projects. Now projects add complexity to feature references.
Improvement because its possible to select features from multiple feature sets.
Features must be unique within a project. Downside is that it's unintuitive to have name conflicts on feature names across feature sets.

Feast 0.5

Projects are taken out of feature references. Only one project can be set per request.
Projects are optional (if not provided, uses default project)
Feature sets optional (if not provided infers feature set from feature name)
Downside is that feature conflicts are possible.

What do we want?

Feature references should be as simple as possible (just the feature name)
Have isolation that prevents name conflicts

Proposal

Both proposals

Projects are sent outside of feature reference but still in the request
Feature sets do not have to be provided. They become unnecessary in retrieval.

Proposal 1: Projects as selectors/tags

Feature set names are globally unique (not just in a project)
Feature names only have to be unique within a feature set
In this approach a list of features can be added to a project. Essentially a project is a view of a subset of features (names). Projects are collections of features that are relevant to the retriever. Projects might not be the right term to use. The term project could also be renamed if its unintuitive.

Proposal 2: Projects as folders

Keep projects out of feature references, enable global uniqueness in feature names again in Feast 0.5.

Isolation still granted by projects
Features can always be uniquely referenced directly by name
Downside is that feature name conflicts within a project will be a lot more common than in Option 1, but feature set name conflicts will be rare.

I am ignoring entities as part of this discussion because adding them to a feature reference (in the way that Uber does it) would not be a breaking change.

Feedback would be highly appreciated now in order to avoid breaking changes in the future. If I don't hear back from anyone then we will probably proceed with Proposal 2 since it is more conservative and maintains project functionality.

cc @khorshuheng @mrzzy @ches @idahoakl @Yanson

mrzzy commented 4 years ago

What Currently Exists

Feature Sets

v0.3: Feature sets add complexity to feature references. Users just want to work with feature names. v0.4: Improvement because its possible to select features from multiple feature sets.

Feature Sets are introduced as a way to group data sources with a schema of what Features are ingested into Feast: It stored configuration on the data source and soured features schema. As a purely ingestion concept, it typically bares no relation with how the users retrieve their features. Hence users will think that it is a hassle to have to deal with Feature Sets.

v0.4: Features must be unique within a project. Downside is that it's unintuitive to have name conflicts on feature names across feature sets.

I think this happens when we are trying to stretch the Feature Set, a purely ingestion concept to also serve as a way to logically group Features (ie driver features, customer features). Feature Set schemas are coupled to resemble how the data is stored in the data sources, so its not possible to serve the other aim of logically grouping features with just feature sets alone.

v0.4: Downside is that it's unintuitive to have name conflicts on feature names across feature sets.

Here the intuitive need for logical grouping and namespacing of Features is present. Users work around this prefixing their feature names: <entity>_<feature name> etc. I think it would still be valuable to have some way of grouping related features together for better organisation, without being tied down by the coupling with the data source that is present in FeatureSets.

Projects

Removed feature sets from feature references, but added projects. Now projects add complexity to feature references.

Projects were introduced to attempt to solve the shortcomings of Feature Sets and as a stepping stone for Authentication. Projects also gave the ability for users to namespace Features. However, how projects should be used is not clearly defined or fits with the data model. Should each project be meant for an entire team/company or should each projects be created for each model? Just like Feature Sets, this added another layer of complexity that they would like not have to worry.

Projects present a isolated view for users of Feast, providing a bubble for users to work with. On the surface, this is beneficial as users will never step on another's toes in another project. However, this an antithesis to the objectives of Feast: Feature Reuse. Users would not be able to discover new features on Feast as we present only the features in this artificial project bubble by default.

Projects, as it stands currently, does not seem to find a proper place in the Feast data model. Within Gojek, we are trying to move away from projects by moving all Features/Feature Sets into one mono project.

What We Want

Feature references should be as simple as possible (just the feature name)

In my opinion, this is misnomer, as users will engineer complex feature names to manually organize and namespace their Features. (ie <team>_<model>_<name>) etc. This pushes complexity in to the Feature names instead of removing it as implied.

Proposals

Proposal 2: Projects as folders

Keep projects out of feature references, enable global uniqueness in feature names again in Feast 0.5.

Globally unique feature names pushes users to manually namespace their Feature names on their own accord. This wild west of naming with no conventions enforced might result in Feast becoming a sea of cryptic Feature names. Some users might start naming features <model>_<feature> while others might prefer <team>_<entity>_<feature>.

Isolation still granted by projects Downside is that feature name conflicts within a project will be a lot more common than in Option 1, but feature set name conflicts will be rare.

Removing the namespacing effect of Projects effective is akin to doing what we did to Feature Sets in v0.4: turning it into another leaky abstraction.

Proposal 1: Projects as selectors/tags

In this approach a list of features can be added to a project. Essentially a project is a view of a subset of features (names). Projects are collections of features that are relevant to the retriever. Projects might not be the right term to use. The term project could also be renamed if its unintuitive.

As outlined above, there is still a need for a way to create a logically grouping of Features that is not tied to the data source like Feature Set is. This "reincarnation" of projects as a view of Features exposed by Feature Sets could serve as that missing piece. However, I think trying to anticipate what could be relevant to the retriever in "view" should be a non goal, as its hard to anticipate.

Untitled(5)

Firming up this "view" into something more concrete: a Feature Entity that logically groups features based on a specific concrete entity: For example, lets say that I would like to track features for an driver entity. I would like to track his average rating and vehicle model. I can create a Feature Entity driver and expose features driver.avg_rating & driver.vehicle_model. It references its features from the Feature Sets and exposes the features for retrieval to the user (akin to what a Service is to deployment in Kubernetes). This decouples the ingestion/data sourcing of the features from the logical grouping, allowing the logical grouping to vary independent of how data is sourced/source data schema. Features in a Feature Entity can reference different Feature Sets, and thus also be populated by different data sources.

Each Feature Entity should be an authoritative view of the Features for that entity. A driver Feature Entity should contain all the Features associated with driver. This simplifies feature discovery by providing allowing users to browse features by entity. Like Feature Sets, it also namespaces the Features it exposes. During retrieval, users retrieve from Feature Entities instead of Feature Sets. Retrieval across Feature Entities is allowed. Feature References in the following format: <feature entity>.<name>. The Feature Entity part isn't optional.

Consideration

Adding a new concept would add even more complexity to an already bloated data model

Currently, we regularly see these Features names as some combination of entity name and actual feature name (ie median_price_car or driver_average_rating). As we have seen, having a single unique feature name does not mean complexity is removed, rather that it moves it the name itself. Retrieval from Feature Entities could look like car.median_price or driver.average_rating.

woop commented 4 years ago

Thanks for this post @mrzzy, you've outdone yourself as usual.

Hence users will think that it is a hassle to have to deal with Feature Sets.

Yip, agreed.

v0.4: Features must be unique within a project. Downside is that it's unintuitive to have name conflicts on feature names across feature sets.

I think this happens when we are trying to stretch the Feature Set, a purely ingestion concept to also serve as a way to logically group Features (ie driver features, customer features). Feature Set schemas are coupled to resemble how the data is stored in the data sources, so its not possible to serve the other aim of logically grouping features with just feature sets alone.

I agree with the point you are making, but I am not sure how this relates to projects that you are replying to. If anything, projects allows for referencing features in a way that doesnt require a feature set and gets past that grouping. These conflicts would still exist if we had features defined one by one in a project.

v0.4: Downside is that it's unintuitive to have name conflicts on feature names across feature sets.

Here the intuitive need for logical grouping and namespacing of Features is present. Users work around this prefixing their feature names: _ etc. I think it would still be valuable to have some way of grouping related features together for better organisation, without being tied down by the coupling with the data source that is present in FeatureSets.

Yes, agreed. I don't see a major downside to more verbose feature names for the time being though, especially if it buys us time to make more informed decisions.

Removed feature sets from feature references, but added projects. Now projects add complexity to feature references.

Projects were introduced to attempt to solve the shortcomings of Feature Sets and as a stepping stone for Authentication. Projects also gave the ability for users to namespace Features. However, how projects should be used is not clearly defined or fits with the data model. Should each project be meant for an entire team/company or should each projects be created for each model? Just like Feature Sets, this added another layer of complexity that they would like not have to worry.

As @ches, auth can happen at other layers, it doesn't have to be at the project layer. Projects was meant to abstract away feature sets (and ingestion level grouping) and to allow for direct referencing of features within your project (or across). I think the design mistake here, looking back, was incorporating it into the feature reference. Projects as an isolation system still adds value in my opinion, but should not add complexity to the workflow of the user, as you have rightly pointed out.

Projects present a isolated view for users of Feast, providing a bubble for users to work with. On the surface, this is beneficial as users will never step on another's toes in another project. However, this an antithesis to the objectives of Feast: Feature Reuse. Users would not be able to discover new features on Feast as we present only the features in this artificial project bubble by default.

I wouldn't say that this is true. The original design of projects was to allow for sharing of features (and later possibly entities) across projects. So you should be able to retrieve features from multiple projects, not just one, in a single query. That is why projects was included in the feature reference.

Projects, as it stands currently, does not seem to find a proper place in the Feast data model. Within Gojek, we are trying to move away from projects by moving all Features/Feature Sets into one mono project.

This isn't exactly true. We aren't moving away from projects, we are collocating our production feature sets in a a single project, which happens to be the default project, in order to finish the very discussion we are having right now and decide no the future direction of the data/concept model. The reason for this approach is to ensure forward compatibility.

If we go the tag/label based approach then we might need to have unique feature set names, so collocating all feature sets in a project ensures that. If we go the folder based approach then we have also ensured that because all features and feature sets are in one project.

Finally, using one project, especially the default project, also ensures that all feature references on clients only have the feature names. So it's less likely that we will have a breaking change in the future than having multi-component feature references client side.

Feature references should be as simple as possible (just the feature name)

In my opinion, this is misnomer, as users will engineer complex feature names to manually organize and namespace their Features. (ie ) etc. This pushes complexity in to the Feature names instead of removing it as implied.

At the modeling stage the feature reference will be collapsed into a single string, so whatever concepts we come up with will just fall away. The question is just how prescriptive we want to be and how many concepts we want to introduce. So we might not be able to get away with just feature names, but feature names are the essential complexity we have right now. Projects, ~versions~, feature sets, these are all incidental or optional.

Globally unique feature names pushes users to manually namespace their Feature names on their own accord. This wild west of naming with no conventions enforced might result in Feast becoming a sea of cryptic Feature names. Some users might start naming features while others might prefer _.

I don't think anybody would disagree with you, but from an API design perspective it is easier to go from my_model_my_feature to my_model.my_feature if Feast introduces a concept, compared to the other way around. An assumption here is that we will have a better understanding of the domain in the future than we have now, so we want to delay the introduction of new concepts until we are absolutely sure that it makes sense. So globally unique names simple prevents breaking changes.

Firming up this "view" into something more concrete: a Feature Entity that logically groups features based on a specific concrete entity: For example, lets say that I would like to track features for an driver entity. I would like to track his average rating and vehicle model. I can create a Feature Entity driver and expose features driver.avg_rating & driver.vehicle_model. It references its features from the Feature Sets and exposes the features for retrieval to the user (akin to what a Service is to deployment in Kubernetes). This decouples the ingestion/data sourcing of the features from the logical grouping, allowing the logical grouping to vary independent of how data is sourced/source data schema. Features in a Feature Entity can reference different Feature Sets, and thus also be populated by different data sources.

Each Feature Entity should be an authoritative view of the Features for that entity. A driver Feature Entity should contain all the Features associated with driver. This simplifies feature discovery by providing allowing users to browse features by entity. Like Feature Sets, it also namespaces the Features it exposes. During retrieval, users retrieve from Feature Entities instead of Feature Sets. Retrieval across Feature Entities is allowed. Feature References in the following format: .. The Feature Entity part isn't optional.

I resisted talking about entities because I thought the conversation might be orthogonal, but now might be a good time to bring it up.

My first question to you is: How is a feature entity different from an entity? Do we need a new concept if it already exists? All of our existing features only occur on a fixed entity (which might be composited) so it can in theory be grouped and exposed based on that.

The reason I thought this was an orthogonal discussion is because unique features could be a good first step. For example lets say you have an account balance feature on drivers. Feature ref is just account_balance. Tomorrow another team says they want to use account_balance for customers. At that point you can introduce entities into the feature reference to have customer.account_balance or driver.account_balance.

The way that Uber does feature references is something like

What I was expecting this view to morph into was the feature group, which was a grouping mechanism around features in a large namespace (around an entity). However, I think this might be an unnecessary complexity to introduce right now, given that there is a workaround in naming features. This is also why I am leaning towards Proposal 2 because it retains projects as a means of isolation (so that users can develop safely on one side, while prod systems run on the other side).

Your suggestion seems to be more focused on the entity component, which I think is the more natural part to introduce next. #405 is relevant here, since we would need to introduce entities as a top level concept for users to define. Then feature sets should probably select these entities during creation, instead of creating them. This would also help at retrieval time to dedupe incoming requests.

Am I correct in saying that entities and feature entities would be the same thing?

khorshuheng commented 4 years ago

Between proposal 1 and 2, i will choose 2, simply due to limited scope of changes and less work needed for a migration. However, as @mrzzy mentioned, going this path will only make feature names more complicated.

If we are allowed to propose non backward compatible changes:

If we use relational database as an analogy:

project should be like a database. User can choose not to create their own database and use a single, default database. If the client doesn't specify any database explicitly, either via query or setting the database beforehand, it is assumed that the default database is used. Query across database, however, is still possible, by referencing the column / features using the full path: [database].[table].[column].
feature sets / table are unique within a project / database, but not necessarily globally. features / columns are unique within a feature set / table, but not across different tables. Table names are used as a way to namespace columns during query.
What @mrzzy proposed would be similar to a table view, in which columns in different tables / databases can be grouped into one logical view and queried as if it is a single table / feature set.

woop commented 4 years ago

Table names are used as a way to namespace columns during query.

So by implication you are suggesting feature_set:feature_name should be a possible feature reference?

khorshuheng commented 4 years ago

So by implication you are suggesting feature_set:feature_name should be a possible feature reference?

Yes, just like how it is not possible to do an SQL queries without referencing the table names. I am aware, however, that this is a very big change from the previous versions of Feast.

Using familiar database concepts would address one of the pain points about Feast, namely the abstract concepts. Database, tables, views are already concepts which the data scientist are familiar with.

woop commented 4 years ago

So by implication you are suggesting feature_set:feature_name should be a possible feature reference?

Yes, just like how it is not possible to do an SQL queries without referencing the table names. I am aware, however, that this is a very big change from the previous versions of Feast.

It's not 100% clear what you mean. Tables / Columns function quite similar feature sets and features right now. They are inferred, and otherwise conflict.

If you have

SELECT Col1 FROM table1

Then you don't specify the table as part of the column reference. If you have two tables then you have something like

select a.Col1, b.Col1 FROM table1 a JOIN table2 b ON a.key=b.key

If you don't provide a table alias then you get a conflict.

The only real difference is that the table is provided per query/request.

Using familiar database concepts would address one of the pain points about Feast, namely the abstract concepts. Database, tables, views are already concepts which the data scientist are familiar with.

Yea I agree with this principle, but not necessarily the recommendation.

Wirick commented 4 years ago

Hi friends, just going to chime in here because I've been thinking about this from the perspective of bigtable keys, and also the expressed desire for teams to collaborate and reuse feature references. we have a few different feature sets that have the same name, but because we auto prefix all of our feature set applys with the namespace of the feature set. We have also taken the route right now of only using one project, but slip a namespace into our keys. We end up with something like

ml_project/risk_place:1:place_uuid=SOMEUUID
...
<project>/<namespace>_<feature_set_name>:<version>:<entity1>=VALUE

as our cassandra keys, with the feature column having the feature name. While imagining how to construct our bigtable keys, we want to make reads performant when looking up features from the same feature set, thus they must be ordered lexicographically. And teams may want to use features from the same feature set name across namespaces (the risk team and the buyer team both have features associated with places for example,) so the idea would be something like

<feature_set_name>#<entity1>=VALUE#<namespace/project>#<feature>

This would allow for performant reads of prefixes, where you only need to read one interval with every call, and you don't have to read other namespaces features if you don't want to, but if you do it will be performant, since every entity's features are adjacent lexicographically. This requires coordination from ML org since you need to name your feature sets corresponding to the entities that represent their keys. I'm pretty agnostic as to what we ultimately do, since I fit the implementation to be performant even if I need to do some fork trickery, but it does seem like project (I think in terms of namespaces) is a useful piece of information if you want to colocate yet partition features

tfurmston commented 4 years ago

Feature sets are a core part of the existing data model. Feature data is stored on a feature set within a feast store like Redis or BigQuery. In order to find the features a user is looking for, it is still necessary to determine the feature set they need from their feature reference. This seems to work at retrieval time since Feast Serving can maintain a cache of available feature sets (albeit introducing a new inefficiency during lookup).

Can I ask why features are stored with this table structure? For example, why are the features not split across tables by feature instead of feature sets. Is it for performance issues at query time?

Maybe it is tangental to this issue, but I am just trying to understand the reasoning for the design choice.

tfurmston commented 4 years ago

Globally unique feature names pushes users to manually namespace their Feature names on their own accord. This wild west of naming with no conventions enforced might result in Feast becoming a sea of cryptic Feature names. Some users might start naming features while others might prefer _.

If there is no mechanism for feature name consistency, then I believe that people will have poorly constructed and inconsistent names regardless of whether or not there are unique feature names.

For example, the same happens with column names in databases, with inconsistencies with the use of lower cases and upper cases. This happens regardless of whether people can use the same name in different tables. If you want some form of consistency in naming, then I think you should have a mechanism to do that and it is not clear to me that projects is that.

I agree that consistent naming is important. I am just not sure having the ability to use the same name in different projects is going to solve that issue.

tfurmston commented 4 years ago

The million dollar question is whether projects are still valuable as a means of isolation.

To my mind, the main proposition of projects would be to have more granular control of how features are organised across a company's feature store. This comes not just in terms of individuals wanting to develop in isolation, but also in terms of adding logical structure to the features such that they are easier to use and maintain across the business.

To use the example of data engineering, in the company where I work our BigQuery data warehouse is split across different BigQuery projects. For example, we have core data models are made available across the business and come with a high level of maturity in terms of things such as SLAs on the corresponding tables. Meanwhile, we also have extension data models that are used by individual teams or groups of teams in a particular department. The barrier for putting ETLS into production in these extension models is less, but the SLAs are sufficient for the needs of these teams.

Not only does this type of separation make logical sense for the business, but this type of separation allows for easy control of user access settings. For example, if a user needs access to tables that are relevant to a given department, then it is only necessary to given them to the few appropriate projects.

Naively I could see projects in FEAST playing an analogous role to this type of separation, e.g., one project could be the core features used across the company for a particular entity of interest, such as a customer.

woop commented 4 years ago

For example, why are the features not split across tables by feature instead of feature sets. Is it for performance issues at query time?

Performance. We used to do this for Feast 0.1, but the performance was much worse. It requires a lookup for each feature, and this is especially costly for large joins. Not to mentioned that you would have 75% of the storage being used for the keys and 25% for the values.

If there is no mechanism for feature name consistency, then I believe that people will have poorly constructed and inconsistent names regardless of whether or not there are unique feature names.

Agreed.

For example, the same happens with column names in databases, with inconsistencies with the use of lower cases and upper cases. This happens regardless of whether people can use the same name in different tables. If you want some form of consistency in naming, then I think you should have a mechanism to do that and it is not clear to me that projects is that.

I agree that consistent naming is important. I am just not sure having the ability to use the same name in different projects is going to solve that issue.

Correct, projects just tries to solve namespacing and isolation. It wont solve the naming issue.

Not only does this type of separation make logical sense for the business, but this type of separation allows for easy control of user access settings. For example, if a user needs access to tables that are relevant to a given department, then it is only necessary to given them to the few appropriate projects.

This was one of the reasons for bringing in projects in the first place. We wanted to have isolated namespaces with some form of access control. I still think that is valuable, and I agree with everything you said. Projects are an intuitive concept that people just "get".

And in fact our auth PR adds more controls there #554 which I hope we can expand upon.

All this being said, projects within feature references add complexity that isn't necessarily warranted, which is why I would like to phase that out. Meaning a project is not a means for sharing, it is a means for isolation, otherwise we move away from our goal of users referencing features by entity/feature name.

@ches has also called out that it would be just as simple to allow access control on feature set as it is on projects

How we are internally structuring our projects are as follows.

All production feature sets are version controlled as YAML in a single repository
We dont allow custom projects to be used. Only the default project is used for production
We have validation in CI for feature sets that will be applied
Once CI passes, the feature sets are updated in Feast Core
We dont allow duplicate feature names for the time being.
For feature retrieval, all of our users just use the feature name (inferred feature set, default project).

Once we roll out access control we will also allow teams to start using custom project namespaces for their development and iteration, with different SLAs.

Naively I could see projects in FEAST playing an analogous role to this type of separation, e.g., one project could be the core features used across the company for a particular entity of interest, such as a customer.

Division of projects along entities seems like it could lead to users wanting to request features from multiple projects in one go, which leads to more complex feature references. One approach that has been mentioned that I think can achieve the same thing is namespacing through entities, so basically <customer>:<feature1> and eventually introduce a grouping concept <customer>:<feature-group>:<feature1>

But we would still have all features relevant to a consumer in a single project.

Yanson commented 4 years ago

All production feature sets are version controlled as YAML in a single repository

We dont allow custom projects to be used. Only the default project is used for production

We have validation in CI for feature sets that will be applied

Once CI passes, the feature sets are updated in Feast Core

We dont allow duplicate feature names for the time being.

Any chance you will open-source this CI pipeline? We want to do much the same thing.

woop commented 4 years ago

All production feature sets are version controlled as YAML in a single repository

We dont allow custom projects to be used. Only the default project is used for production

We have validation in CI for feature sets that will be applied

Once CI passes, the feature sets are updated in Feast Core

We dont allow duplicate feature names for the time being.

Any chance you will open-source this CI pipeline? We want to do much the same thing.

Don't mind open sourcing it, but not sure if we will have an opportunity any time soon. I'll set a reminder for myself.

tfurmston commented 4 years ago

Thanks for the response @woop

Performance. We used to do this for Feast 0.1, but the performance was much worse. It requires a lookup for each feature, and this is especially costly for large joins. Not to mentioned that you would have 75% of the storage being used for the keys and 25% for the values.

OK, I see. Thanks for explaining. I assumed it must be something like this.

This pattern still ties consumption patterns to the way in which features are ingested though, right? This is something that still feels a bit odd to me.

It feels like a lot of these issues have a data engineering/data modelling type feel, .e.g., how to structure your underlying data so that queries are performant.

I just wonder whether there is a one-fits-all solution to this that will work for everyone, or whether it would be possible to provide users with more flexibility in how they structure the underlying data. For example, allowing people to make intermediary tables themselves that contain the features they want and that will be synced to the underlying features. I think several people have suggested something similar above.

Again, I know this is probably tangential to this issue, so feel free to ignore me. I was just wondering if this is something you have considered.

woop commented 4 years ago

I just wonder whether there is a one-fits-all solution to this that will work for everyone, or whether it would be possible to provide users with more flexibility in how they structure the underlying data.

@tfurmston for the record I 100% agree that there is room for optimizing the data model here and to provide users with more flexibility. One way to do this is allow feature sets to be defined as materialized views.

name: features_that_I_will_retrieve
features:
- name: f1
  ref: fs1:f1
- name: f2 
  ref: fs2:f1

During ingestion we could write to these materialized views as well as the original feature set tables. However, I don't see too much value of doing this for historical data. I do think it would be useful for online serving. In the case of online serving though, it would probably require a read + write since data will be coming in separate events. Alternatively we could maintain state in the ingestion jobs to support this.

tfurmston commented 4 years ago

I just wonder whether there is a one-fits-all solution to this that will work for everyone, or whether it would be possible to provide users with more flexibility in how they structure the underlying data.

@tfurmston for the record I 100% agree that there is room for optimizing the data model here and to provide users with more flexibility. One way to do this is allow feature sets to be defined as materialized views.
name: features_that_I_will_retrieve
features:
- name: f1
  ref: fs1:f1
- name: f2 
  ref: fs2:f1
During ingestion we could write to these materialized views as well as the original feature set tables. However, I don't see too much value of doing this for historical data. I do think it would be useful for online serving. In the case of online serving though, it would probably require a read + write since data will be coming in separate events. Alternatively we could maintain state in the ingestion jobs to support this.

Perhaps that would work. To be honest, I still don't have enough usage of feast from the user perspective to know either way. Just from reading this thread, it does seem that there are issues with the data model. Hence my comments.

Re-reading @mrzzy proposal from the 25th, i.e., grouping by entity. I think this makes a lot of sense. Maybe this would make a good default and then if it transpires that people need more flexibility, then address it then.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lucasbmiguel commented 3 years ago

hey folks, is there a decision here? I think the discussion is super relevant, any reason for the issue to be closed?