feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.56k stars 993 forks source link

Entity types as a higher-level concept #405

Closed woop closed 4 years ago

woop commented 4 years ago

Introduction

Currently an entity, or more formally an entity type, is treated as a special type of field within a feature set. There has been an attempt to simplify the creation and management of entities and to keep them consistent with features, however some challenges exist with our current approach.

Note: The terms entity and entity type will be used interchangeable in the following issue.

How are entities created?

How are entities used?

What is the problem?

  1. Discovery: It seems intuitive that users would start their discovery experience from the point of view of an entity type, since their business problem is generally framed around one or more entities. By nesting entities within feature sets and within projects and not providing a discovery means, it makes discovery harder.
  2. Consistency: Entities are typically consistent across all projects and systems in most organizations. This consistency is not enforced in Feast at the moment. Users are bound to redefine entities in their local projects if no consistency is enforced at an organizational level. Failure would occur when lookups happen or when joins happen across feature sets, especially when joins need to happen across projects.
  3. Key building: If entities and features maintain mutual compatibility in terms of supported data types, then support must be maintained for building keys from all feature value types. This adds a lot of complexity to key building since support must be maintained to serialize complex composite data structures in order to build these keys.

Proposals

1. Project-level entities

Functionality

Advantages

Disadvantages

2. Global-level entities

Functionality

Advantages

Disadvantages

3. Default project entities

Functionality

Advantages

Disadvantages

khorshuheng commented 4 years ago

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

woop commented 4 years ago

Most projects would have to reference entities from another more authoritative project

What would be an example scenario where this approach is the most sensible? For Gojek at least, i would imagine that project based entities make more sense. One project per service type (food, ride, gopay), each having entities which might share the same name (customer id, driver id).

The example you are referring to would be for project-level entities. Meaning an organization could have authoritative projects like:

It seems to provide a cleaner isolation, but it is also the case that "users" would have to define their own projects and feature sets from which they would reference these authoritative entities.

So I am only seeing one option here, not two. The disadvantage comes from having to know whether to use either of these two projects.

woop commented 4 years ago

Another possible solution would be a hybrid model between global and project level entities. I have added this as (3) in the comment above, titled 3. Default project entities

khorshuheng commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

khorshuheng commented 4 years ago

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

woop commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

woop commented 4 years ago

Though, if we go for option 3, we might want to explore if the concept of default project should be extended to feature retrieval as well, for consistency. For example, if no project / default project has been set and project is not explicitly specified in feature ref, then the fallback would be the 'default' project.

Absolutely, that was my hope as well!

khorshuheng commented 4 years ago

I am in favour of 3. Option 2 (unique global entity name) may lead to complicated entity management for some cases. For example, let say we have drivers for different countries. Option no 2 dictates that we cannot have the same entity for all country (eg. driver), but instead, multiple different entities. (eg. driver_vn, driver_th, driver_sg). It is likely that in an end to end machine learning workflow, the code section involving the drivers will be similar regardless of country (eg. Extracting driver entity value from JSON request during prediction step). So, for option no 2, the pipeline will need to know that driver_vn, driversg and driver th all belongs to the same group and should be handled the same way, which leads to extra configurations on the user side.

Its not clear what you mean here. What prevents you from having simply driver as a global entity?

Actually, yeah you are correct, I can just have driver in a global project instead of having the entity defined in each regional project. Too entrenched in the code base that I am currently working on and didn't consider this possibility.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

woop commented 4 years ago

Moving this out of the 0.6 milestone because I think we can live without it for the time being.

dr3s commented 4 years ago

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

woop commented 4 years ago

Isn't 3. the same as 1. with just a special project called default? The fact there is a special default project doesn't change the fact that all entities are scoped to a project ie 1. Right?

Correct.

KshitizLohia commented 2 years ago

Entity as a construct I believe is increasing complexity in the system. What I fail to understand is how the notion of entity is helping in grouping semantically related features together (as per the definition of entity in the documentation). Also, it introduces more problem as joins are happening at the later point of time and entity is defined at the start of user experience.

Few questions:

  1. Shouldn't entity just be a logical container specifying join keys? In which case, how can we specify join keys before join operation. For instance, let's say join on entity A and entity B could use one join key and join for entity A and entity C could use another join key.
  2. How can we chain the join operations and perform complex join operations. For example ((A left join B) right join C)?
  3. How can we handle shadow mapping using entities. For example, customer id of entity customer is linked to user id of entity user?

Just want to take others suggestion on the same!