Entity types as a higher-level concept

woop commented 4 years ago

Introduction

Currently an entity, or more formally an entity type, is treated as a special type of field within a feature set. There has been an attempt to simplify the creation and management of entities and to keep them consistent with features, however some challenges exist with our current approach.

Note: The terms entity and entity type will be used interchangeable in the following issue.

How are entities created?

Users define an entity as part of a feature set. An entity in this case is a field like any other within the feature set. More than one entity can exist within a feature set.
An entity's name must be unique within a feature set.
There are no constraints on entities outside of a feature set, either at the project or global level. This means that multiple feature sets can define the same entities again.

How are entities used?

Retrieving feature values: Entities are used as a key for retrieving features. In order to retrieve feature values within a feature set, all entities must be provided as part of the lookup.
Joining feature sets: In the event that feature values are being retrieved from multiple feature sets, entities are used to look up these feature values. Entities are also used to join across these feature sets to construct a single result set.

What is the problem?

Discovery: It seems intuitive that users would start their discovery experience from the point of view of an entity type, since their business problem is generally framed around one or more entities. By nesting entities within feature sets and within projects and not providing a discovery means, it makes discovery harder.
Consistency: Entities are typically consistent across all projects and systems in most organizations. This consistency is not enforced in Feast at the moment. Users are bound to redefine entities in their local projects if no consistency is enforced at an organizational level. Failure would occur when lookups happen or when joins happen across feature sets, especially when joins need to happen across projects.
Key building: If entities and features maintain mutual compatibility in terms of supported data types, then support must be maintained for building keys from all feature value types. This adds a lot of complexity to key building since support must be maintained to serialize complex composite data structures in order to build these keys.

Proposals

1. Project-level entities

Functionality

Entities are created outside of feature sets, but they still reside in a specific project namespace.
Entities have their own distinct API and supported data types (which may be more limited than features)
Entities must be unique within a project namespace, but can be duplicated across an organization. Uniqueness is ensured through a full entity reference (gojek/customer).
Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

Entities receive all the sharing and isolation benefits of "projects". Entities would not have to be treated separately from a logical and/or development standpoint. There would also be no explosion of a global entity namespace
Users are free to experiment and develop within their projects without affecting other users, since duplication is allowed across projects.
No need for a central team to gate-keep the creation of entities.

Disadvantages

By not elevating entities to the global level, end users would be required to know which projects contain the entities they should be referencing. This means an organizational process must exist in order to select these entities.
Most projects would have to reference entities from another more authoritative project. In fact, it's likely that an organization will have a central project which contains only entities. This could be a little counter-intuitive if a feature set contains fields that are referencing an external project.

2. Global-level entities

Functionality

Entities are defined globally for a Feast deployment.
Entities have their own distinct API and supported data types (which may be more limited than features).
Entities must be globally unique.
Entities are still defined as part of a feature set, but this is a selection process instead of creation.

Advantages

Central authoritative listing of entities within an organization.
Easier to discover which entities should be used, without needing an organizational policy.
Easy to reason about and easier to understand when referencing an entity within a feature set.

Disadvantages

Requires development of separate logic from projects, feature sets, and features.
Requires a team and process to manage the creation of entities.
No way to isolate conflicts. If one team wants to use a float and another wants to use a string for an entity data type, then it would likely result in two entities being created. This would still be the case in the Project-level entity proposal, but at least in that proposal the unorthodox approach (maybe string) could be isolated to a specific project.

3. Default project entities

Functionality

If a user does not specify a project, then they are automatically located inside of the default project. This would be similar to how Kubernetes does namespacing.
All other functionality would be the same as the project level entities proposal, except users don't actually have to create an entity inside of a named project.
Feature references could be created that allow users to reference entities without a project. So instead of having my_company/customer, it would be possible to refer to "global" entities by either using customer or default/customer.

Advantages

All of the advantages of project-level entities.
Most of the advantages of global-level entities, except that this default project would still not be a true global namespace. There would still need to be an organizational process that informs users to use the entities in this project.
Simplifies development since project-level sharing and isolation can be reused.

Disadvantages

Still requires access control on the default namespace.

khorshuheng commented 4 years ago

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

woop commented 4 years ago

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

The example you are referring to would be for project-level entities. Meaning an organization could have authoritative projects like:

gojek/customer
gopay/customer

It seems to provide a cleaner isolation, but it is also the case that "users" would have to define their own projects and feature sets from which they would reference these authoritative entities.

So I am only seeing one option here, not two. The disadvantage comes from having to know whether to use either of these two projects.

woop commented 4 years ago

Another possible solution would be a hybrid model between global and project level entities. I have added this as (3) in the comment above, titled 3. Default project entities

khorshuheng commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

khorshuheng commented 4 years ago

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

woop commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

woop commented 4 years ago

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

Absolutely, that was my hope as well!

khorshuheng commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

Actually, yeah you are correct, I can just have driver in a global project instead of having the entity defined in each regional project. Too entrenched in the code base that I am currently working on and didn't consider this possibility.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

woop commented 4 years ago

Moving this out of the 0.6 milestone because I think we can live without it for the time being.

dr3s commented 4 years ago

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

woop commented 4 years ago

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

Correct.

KshitizLohia commented 2 years ago

Entity as a construct I believe is increasing complexity in the system. What I fail to understand is how the notion of entity is helping in grouping semantically related features together (as per the definition of entity in the documentation). Also, it introduces more problem as joins are happening at the later point of time and entity is defined at the start of user experience.

Few questions:

Shouldn't entity just be a logical container specifying join keys? In which case, how can we specify join keys before join operation. For instance, let's say join on entity A and entity B could use one join key and join for entity A and entity C could use another join key.
How can we chain the join operations and perform complex join operations. For example ((A left join B) right join C)?
How can we handle shadow mapping using entities. For example, customer id of entity customer is linked to user id of entity user?

Just want to take others suggestion on the same!

feast-dev / feast