feast-dev / feast-java-old

Feast Java Components
Apache License 2.0
12 stars 29 forks source link

[Discuss] Feast Enhancements #6

Open aaraujo-sfdc opened 3 years ago

aaraujo-sfdc commented 3 years ago

My team has been evaluating Feast for adoption and have identified a few improvements we'd like to contribute for our use cases. @woop suggested starting a thread for discussion to make sure they are a good fit.

Support for SPI extensions

We'd like to add extension points for our existing infrastructure for things like:

Multi-tenancy support

Our infrastructure is entirely multi-tenant. We authorize API calls and store data on a per-tenant basis. On the surface, we'd need something like the following in Feast:

Direct online write support

We have use cases that would like to write to feature store directly. Adding an ApplyOnlineFeatures API (+SDK support) would satisfy these use cases. The API would semantically resemble the GetOnlineFeaturesV2 API (write feature table values for specific entity rows).

Delete online feature support

We maintain GDPR compliance by propagating delete record signals from upstream data systems. Essentially we'd need a DeleteOnlineFeatures API (+SDK support) that resembles GetOnlineFeaturesV2 and ApplyOnlineFeatures (delete feature table values for specific entity rows).

If these seem like reasonable enhancements to Feast we can file individual issues and PRs to contribute these incrementally.

woop commented 3 years ago

Thanks for kicking this issue off @aaraujo-sfdc!

Support for SPI extensions We'd like to add extension points for our existing infrastructure for things like:

FeatureTable change detection (notifying existing data pipeline and catalog services when a feature table is created or modified)

In an attempt to play devils advocate: Would it be possible (even simpler) to emit events (maybe using open tracing) from Feast Core whenever there is a state change?

Implementing online storage operations for our store (HBase) To be clear, you want both a reader (serving) and a writer (spark), right? I can definitely see the case for SPI here.

Multi-tenancy support Our infrastructure is entirely multi-tenant. We authorize API calls and store data on a per-tenant basis.

I've not thought deeply about this, but would this be specific to your use case? I don't really see other folks needing functionality. I guess the question just becomes how this can be implemented to serve your use case only.

Direct online write support We have use cases that would like to write to feature store directly. Adding an ApplyOnlineFeatures API (+SDK support) would satisfy these use cases. The API would semantically resemble the GetOnlineFeaturesV2 API (write feature table values for specific entity rows).

We call this a PushAPI at Tecton. ApplyOnlineFeatures seems quite overlapping with our existing Apply functionality which is purely for schemas/specs. We could do Set/Write/Update/Push as the name.

We are trying to be more specific about our data contracts, which should help make it easy to add new components.

Delete online feature support We maintain GDPR compliance by propagating delete record signals from upstream data systems. Essentially we'd need a DeleteOnlineFeatures API (+SDK support) that resembles GetOnlineFeaturesV2 and ApplyOnlineFeatures (delete feature table values for specific entity rows).

Seems reasonable.

FYI: Long term we are planning to slowly move towards Go for Serving. We probably won't pick that work up in the next couple of months, but we think we can build a simpler and easier to maintain serving API. I raise this because we have two options here. The pragmatic approach is to extend the Serving code base and reuse storage connectors there (or SPI implementations). Another approach is to build the Delete/Push API as a Go service and phase over functionality over time.

Happy to hear your thoughts.

aaraujo-sfdc commented 3 years ago

Thanks for looking over this @woop, it's quite a lot to digest.

Would it be possible (even simpler) to emit events (maybe using open tracing) from Feast Core whenever there is a state change?

Likely possible, but not sure if this simplifies things for us. A hook we could implement within Feast Core would allow us to synchronously notify existing systems using their APIs without introducing additional components or re-working those systems. Since it would be synchronous, we also don't have to implement state management to make sure external systems successfully processed these events. End users would immediately know that the overall system could not process their request.

Implementing online storage operations for our store (HBase)

To be clear, you want both a reader (serving) and a writer (spark), right? I can definitely see the case for SPI here.

Reader and writer, yes. Initially we'd do both through serving (via the Get/Push APIs). Writing through serving initially would allow us to support pipeline ingestion and direct user writes with minimal work. Down the road we'd likely look into bypassing serving for pipeline ingestion by writing directly from Spark -> HBase. Phrased differently, initially we only need SPIs for the storage operations in Feast Serving. Later we'd likely want to do similar work for Feast ingestion (historical and streaming).

Multi-tenancy support Our infrastructure is entirely multi-tenant. We authorize API calls and store data on a per-tenant basis.

I've not thought deeply about this, but would this be specific to your use case? I don't really see other folks needing functionality. I guess the question just becomes how this can be implemented to serve your use case only.

I would think storing data for multiple tenants is fairly common. An entity in Feast could belong to different tenants, and a TenantID would disambiguate between one tenant's EntityID=123 vs another's. Implementing multi-tenancy would require support for "multi-tenant" feature tables (specifying a TenantSpec when creating a feature table) as well as specifying a tenant for read/write operations. I don't see how we could implement this just for our use case since it needs API support -> storage support. However, it would be optional and backwards compatible (since it's an addition).

ApplyOnlineFeatures seems quite overlapping with our existing Apply functionality which is purely for schemas/specs. We could do Set/Write/Update/Push as the name.

Push/Write/etc. sounds good. I was using the existing convention, which as you pointed out, makes more sense for schemas/specs.

We are trying to be more specific about our data contracts, which should help make it easy to add new components.

Is that a hard data contract or more of a storage specification for the Redis implementation? Our storage implementation would take the API proto types and map them to SQL types (we use Apache Phoenix on top of HBase). This would allow us to view/audit data with standard SQL tools. We would also have one HBase table for each (project, table_name) tuple as opposed to one table with a (project, entity_name, table_name, feature_name) hierarchy.

Long term we are planning to slowly move towards Go for Serving.

Interesting. Same API delivered as a Go service instead of Java? Our users are strictly against doing any sort of service migrations for their apps so it would have to be a simple drop in replacement for us.

The pragmatic approach is to extend the Serving code base and reuse storage connectors there (or SPI implementations).

How would you reuse Java storage connectors (or SPI implementations) in Go? It seems we'd need to reimplement them.

Another approach is to build the Delete/Push API as a Go service and phase over functionality over time.

Wouldn't that mean running two Serving services + routing calls they each support until the Go service has everything?

woop commented 3 years ago

Thanks for looking over this @woop, it's quite a lot to digest.

Would it be possible (even simpler) to emit events (maybe using open tracing) from Feast Core whenever there is a state change?

Likely possible, but not sure if this simplifies things for us. A hook we could implement within Feast Core would allow us to synchronously notify existing systems using their APIs without introducing additional components or re-working those systems. Since it would be synchronous, we also don't have to implement state management to make sure external systems successfully processed these events. End users would immediately know that the overall system could not process their request.

I think the system can still be synchronous. What I am trying to optimize for is purely maintainability. We have limited experience with SPI so we'd be relying on your experience in implementing it.

Implementing online storage operations for our store (HBase)

To be clear, you want both a reader (serving) and a writer (spark), right? I can definitely see the case for SPI here.

Reader and writer, yes. Initially we'd do both through serving (via the Get/Push APIs). Writing through serving initially would allow us to support pipeline ingestion and direct user writes with minimal work. Down the road we'd likely look into bypassing serving for pipeline ingestion by writing directly from Spark -> HBase. Phrased differently, initially we only need SPIs for the storage operations in Feast Serving. Later we'd likely want to do similar work for Feast ingestion (historical and streaming).

Multi-tenancy support Our infrastructure is entirely multi-tenant. We authorize API calls and store data on a per-tenant basis.

I've not thought deeply about this, but would this be specific to your use case? I don't really see other folks needing functionality. I guess the question just becomes how this can be implemented to serve your use case only.

I would think storing data for multiple tenants is fairly common. An entity in Feast could belong to different tenants, and a TenantID would disambiguate between one tenant's EntityID=123 vs another's. Implementing multi-tenancy would require support for "multi-tenant" feature tables (specifying a TenantSpec when creating a feature table) as well as specifying a tenant for read/write operations. I don't see how we could implement this just for our use case since it needs API support -> storage support. However, it would be optional and backwards compatible (since it's an addition).

Do you think there is a way to leverage labels for the TenantID without requiring a TenantSpec? It seems like the only thing that would need to be added here is a way to affect storage through the feature table specification.

ApplyOnlineFeatures seems quite overlapping with our existing Apply functionality which is purely for schemas/specs. We could do Set/Write/Update/Push as the name.

Push/Write/etc. sounds good. I was using the existing convention, which as you pointed out, makes more sense for schemas/specs.

We are trying to be more specific about our data contracts, which should help make it easy to add new components.

Is that a hard data contract or more of a storage specification for the Redis implementation? Our storage implementation would take the API proto types and map them to SQL types (we use Apache Phoenix on top of HBase). This would allow us to view/audit data with standard SQL tools. We would also have one HBase table for each (project, table_name) tuple as opposed to one table with a (project, entity_name, table_name, feature_name) hierarchy.

Both a storage specification for the Redis implementation as well as a specification for general K/V storage. it would not cover RDBMS storage.

Long term we are planning to slowly move towards Go for Serving.

Interesting. Same API delivered as a Go service instead of Java? Our users are strictly against doing any sort of service migrations for their apps so it would have to be a simple drop in replacement for us.

Correct, drop in.

The pragmatic approach is to extend the Serving code base and reuse storage connectors there (or SPI implementations).

How would you reuse Java storage connectors (or SPI implementations) in Go? It seems we'd need to reimplement them.

I meant either or. Either take the SPI route or start with Go. If we start with the SPI route then moving to Go would require a reimplementation.

Another approach is to build the Delete/Push API as a Go service and phase over functionality over time.

Wouldn't that mean running two Serving services + routing calls they each support until the Go service has everything?

I actually think there is a strong case for deploying these services separately in any case (even if they share a code base). The life cycle is different for delete, write, and read and I think in most cases the capacity required would be different as well. I also think its easier to reason about a single writer to a store than multiple.

My intuition is to have one deployment for services that mutate state (update, insert, delete) and one for reading.

aaraujo commented 3 years ago

I think the system can still be synchronous. What I am trying to optimize for is purely maintainability. We have limited experience with SPI so we'd be relying on your experience in implementing it.

If we define a generic interface for this and the default implementation is a no-op, the maintenance overhead should be minimal to none. Probably good to carve this out into a separate issue where we can propose a design and continue there.

Do you think there is a way to leverage labels for the TenantID without requiring a TenantSpec? It seems like the only thing that would need to be added here is a way to affect storage through the feature table specification.

We might be able to define the feature table specification using labels, but since a feature table would store data for all tenants the TenantId would also need to be set when reading or writing from/to the feature table. Is there a generic existing way to set a TenantId attribute via ingestion and serving APIs? We can carve out a new issue for this and propose a design if that helps.

Both a storage specification for the Redis implementation as well as a specification for general K/V storage.

In that case I think it needs to be reworked a bit to separate the "applies to all K/V storage" portions from the "applies only to Redis" portions. For example, this is specific to Redis: When feature data is stored in Redis, we use it as a two-level map, by utilizing Redis Hashes. I don't think you can to that in most K/Vs.

it would not cover RDBMS storage.

Our online provider implementation would be built using Apache Phoenix, which is closer to an RDBMS API than a K/V store API. So perhaps a higher level specification would be needed to cover both.

I meant either or. Either take the SPI route or start with Go. If we start with the SPI route then moving to Go would require a reimplementation.

Given our timeline + Feast's timeline for the Go API we'd need to take the SPI route to begin with.

The life cycle is different for delete, write, and read and I think in most cases the capacity required would be different as well. I also think its easier to reason about a single writer to a store than multiple.

My intuition is to have one deployment for services that mutate state (update, insert, delete) and one for reading.

Sounds like we have options for the new APIs:

They all seem reasonable, but given that Java serving/ingestion will be deprecated at some point, we'd prefer an option that does not require an additional Java micro-service. Thoughts?

woop commented 3 years ago

I think the system can still be synchronous. What I am trying to optimize for is purely maintainability. We have limited experience with SPI so we'd be relying on your experience in implementing it.

If we define a generic interface for this and the default implementation is a no-op, the maintenance overhead should be minimal to none. Probably good to carve this out into a separate issue where we can propose a design and continue there.

Sounds good.

Do you think there is a way to leverage labels for the TenantID without requiring a TenantSpec? It seems like the only thing that would need to be added here is a way to affect storage through the feature table specification.

We might be able to define the feature table specification using labels, but since a feature table would store data for all tenants the TenantId would also need to be set when reading or writing from/to the feature table. Is there a generic existing way to set a TenantId attribute via ingestion and serving APIs? We can carve out a new issue for this and propose a design if that helps.

Yea, I am not 100% sure what that ingest/serving API would look like to be honest. If you could create a basic sketch of what that would look like it would be great.

Both a storage specification for the Redis implementation as well as a specification for general K/V storage.

In that case I think it needs to be reworked a bit to separate the "applies to all K/V storage" portions from the "applies only to Redis" portions. For example, this is specific to Redis: When feature data is stored in Redis, we use it as a two-level map, by utilizing Redis Hashes. I don't think you can to that in most K/Vs.

it would not cover RDBMS storage.

Yes, that's correct.

Our online provider implementation would be built using Apache Phoenix, which is closer to an RDBMS API than a K/V store API. So perhaps a higher level specification would be needed to cover both.

I meant either or. Either take the SPI route or start with Go. If we start with the SPI route then moving to Go would require a reimplementation.

Given our timeline + Feast's timeline for the Go API we'd need to take the SPI route to begin with.

Fair enough.

The life cycle is different for delete, write, and read and I think in most cases the capacity required would be different as well. I also think its easier to reason about a single writer to a store than multiple.

My intuition is to have one deployment for services that mutate state (update, insert, delete) and one for reading.

Sounds like we have options for the new APIs:

  • Define then in the existing ServingService proto and implement them only in the existing Java service (Go would require new proto service defs)
  • Define them in a new gRPC service for both Java and Go. Implement them only in the existing Java service now
  • Define them in a new gRPC service for both Java and Go. Implement them only in a new Java service now

They all seem reasonable, but given that Java serving/ingestion will be deprecated at some point, we'd prefer an option that does not require an additional Java micro-service. Thoughts?

It seems like the most pragmatic course is then the one originally laid out by you, which is to evolve the Serving service for the functionality you laid out. This is probably the least amount of work.

From a purely selfish perspective I'd want us to build out this functionality in a Go service, because I do see us needing that pretty soon (but probably not for 2-3 months). We'll need both the reading and writing functionality, but I am not sure about deletion yet. I understand the reasons not to take this course though.