feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.59k stars 998 forks source link

Remote offline feature server deployment #4032

Closed tokoko closed 4 months ago

tokoko commented 7 months ago

Is your feature request related to a problem? Please describe.

Currently Feast has the capability to deploy both online store (feature server) and registry as standalone services that feast library can access remotely. I'd like to see the same capability with offline store deployment. The primary goal of the feature is to enable fully remote deployment of all feast components and in future enable a common enterprise-level security model for all three components.

Describe the solution you'd like There are two possible general solutions I have in mind:

  1. Arrow Flight Server Deployment

We can deploy an offline feature server as an Apache Arrow Flight Server that will wrap calls to offline store implementations and expose interface methods as Arrow Flight endpoints. This is also something that Hopsworks seems to have implemented before. They are using duckdb engine behind a flight server.

  1. Metadata Sharing Server Deployment (similar to Delta Sharing)

Offline feature server can be deployed as a lightweight rest or grpc server that only handles requests and metadata and utilizes cloud filesystems for data transfer to and from the user. For example in case of s3, feature server would write out the results as a parquet dataset to a folder in an s3 bucket, generate presigned urls for it and return a list of urls to the caller. The idea is to emulate the mechanics of delta sharing but avoid it's limitations (in delta-sharing, dataset needs be in delta format and metadata responses must be immediate, which is obviously not an option for feast).

In both scenarios, client-side code should be virtually indistinguishable from other offline store implementations. The particulars of the implementations will be hidden behind RemoteOfflineStore class.

dmartinol commented 6 months ago

Hi @tokoko, wrt to the first option: 1- When you write client-side code should be virtually indistinguishable it means that when the client invokes fs.get_historical_features(...) this would be transparently forwarded to the Arrow Server, right? (*) I assume that this may happen because the client feature store has an offline store configured in a similar manner:

project: mystore
provider: local
offline_store:
    type: arrow # or remote, or similar
    host: mystore.feast.svc.cluster.local
    port: 8815

Can you confirm it matches, at least in general terms, your initial idea? An alternative could be to define an ad-hoc RemoteFeatureStore that prevents the need of such the configuration: fs = RemoteFeatureStore(host='mystore.feast.svc.cluster.local', port=8815)

2- If the Arrow Server wrap calls to offline store implementations, I assume that Hopsworks usage of the duckdb engine is not needed in our case, as it was meant to replace and optimize the original offline store provider but it would require a specific implementation for each of them. WDYT? Maybe we can have a separate ticket to optimize the queries, instead?

3- Finally, why shouldn't we instead extend the feature server to add an endpoint for the offline features? Is it just a matter of the scale of the data being processed, or are there other factors at play? The reason for asking is that we're going to introduce another service in the architecture, possibly with multiple instances in the same cluster, in case of different repos, and this complexity could confuse users regarding the chosen direction.

(*) To generalize this concept even further, and open the door to apply a consistent security model to the SDK clients, we could also think to implement RemoteOnlineStore and RemoteRegistryStore the same way (based on either the Arrow Server or the existing Feature Server), to simplify the interaction with the remote server(s) and have a unified client API hiding the complexity of the REST/gRPC calls.

tokoko commented 6 months ago

Hey, thanks for getting involved here :)

  1. Yes, that's exactly what I mean. The user sets store type to remote (for example) and FeatureStore client forwards the requests to a remote server rather than executing directly. And also agreed on the "generalize" point as well, I think we should have remote options for all three components (online store, offline store, registry):

    • We obviously already have a feature server and I have a ticket #4121 for a remote client as well. (I may be wrong, but I think Lokesh is working on this.)
    • We actually already have a grpc registry server (#3924 it's still read-only at this point though) and remote client for it (#3941)
    • Once we close this ticket we should have some sort of offline server with it's accompanying remote client implementation.
  2. Yeah, duckdb is just a choice that they made. feast unlike hopsworks tries to cover as many different configurations as possible (for better or worse 😄) and lets the user choose whatever fits their needs. Any performance optimizations should happen separately on engine level. In our case I think the offline server administrator should have an option to set offline store type to any of the engines supported (incl. duckdb) and that will be the one that flight server will be forwarding the requests to.

  3. imho that's the result of inherent duality of feature stores in general, the whole point of it is to be a bridge between offline world and an online one after all. Even if we ignore technical difficulties, I think the usage pattern is very different between these components. online store is almost always queried by some software application and is usually very critical in terms of latency and availability. offline store is sometimes queried by periodical batch production jobs, but is mainly in service of data scientists running their experiments at their own schedule. That's just the conceptual part... From more technical standpoint, The scale is the first obvious difference. Another problem is that we actually already have two feature servers (http and grpc). Which one would we use as a base? We may have to abandon one of them if we take a "single server" route. Also, I've tried in the past to combine flight server with normal protobuf rpc service definitions (unrelated to feast) and failed miserably. It's probably doable, but at the cost of going against the way 99% of people out there do these things.

redhatHameed commented 6 months ago

Just one thought: for easiness of use sdk or remote keep the same feature_store.yaml file and simply add a flag to indicate whether it's a remote server. For example:

offline_store:
  is_remote_server: true  # Flag to indicate whether it's a remote server
  type: postgres
  host: localhost
  port: 5432

feature_store.yamlcontent can be encoded into base64 and stored as an environment for offline store server as done online store.

tokoko commented 6 months ago

@redhatHameed I'm not sure flag is necessary. If we replicate the way remote registry works, on server side feature_store.yaml will look identical to how it looks now. The only difference being that someone will need to run feast serve offline (just an example, something like this) command to start the server. (For a k8s deployment, one would set a command field to ["feast", "serve", "offline"]):

offline_store:
  type: postgres
  host: localhost
  port: 5432

On the client-side, the user will have to configure it's client to query the server, it should have no knowledge whatsoever about with type of offline store the server's wrapping. So feature_store.yaml for client will look something like this:

offline_store:
  type: remote-flight
  host: mystore.feast.svc.cluster.local
  port: 8815
redhatHameed commented 6 months ago

@tokoko Thanks for clarifying and that make sense.

dmartinol commented 6 months ago

@tokoko just to validate our understanding, is the expectation to implement all the methods of the OfflineStore interface (currently 5) in both the server and the client?

tokoko commented 6 months ago

@dmartinol yes, but let me go a bit more into details just to make sure we're on the same page. The first three of those five methods are lazy, in other words they don't return datasets themselves, they return RetrievalJob objects that can be used by the user to get the dataset. So while yes, there should be a way to invoke GetFlightInfo on the server with all 3 of them (with dedicated proto message types, I guess), that invocation should only happen when the user does something like the following -> (store.get_historical_featues(...).to_arrow() // or .to_df())

if the user instead opts to call to_remote_storage() method on the RetrievalJob object, then client implementation will probably have to call the server with similar message types, but on DoAction endpoint instead of GetFlightInfo as there's no data to be returned directly.

dmartinol commented 6 months ago

agree, we must respect the lazy behavior. @redhatHameed we need to experiment on this one as a first step

redhatHameed commented 6 months ago

@tokoko one more clarification what you think about Materializing functionality will this be part of remote Offline store deployment, as Feast documentation mentioned this as primary functionality.

tokoko commented 6 months ago

@dmartinol I think starting with pull_latest_from_table_or_query might be easiest to get the basic functionality, as there's no need to upload anything the the server.

@redhatHameed that's a very good point. I haven't thought much about it, but I think it should depend on which materialization engine the user has configured.

redhatHameed commented 6 months ago

Thanks @tokoko in that case we can create a new issue to address the implementation of the RemoteMaterializationEngine, while keeping the focus of this ticket on implementing the RemoteOfflineStore. cc @dmartinol

dmartinol commented 6 months ago

@tokoko, I see that all the implementations of BatchMaterializationEngine already fetch the offline data using the OfflineStore interface, as in:

        offline_job = self.offline_store.pull_latest_from_table_or_query(...)

That said, if the client doing the materialization already has a remote offline_store, I assume that the materialize would work through the flight server, correct? What else could a RemoteMaterializationEngine add, in this case?

tokoko commented 6 months ago

yup, It would work through the flight server, but only in the sense that the source data will be pulled from there. All the rest of the logic (conversion to arrow and online_write_batch calls) will happen locally (in case of local materialization engine, for example). With RemoteMaterializationEngine, we could offer another implementation that pushes not only pull_latest_from_table_or_query, but the whole materialization pipeline to the server.

redhatHameed commented 6 months ago

@tokoko me and @dmartinol created a draft PR. To get your initial input on whether we're heading in the right direction. Please take a look when you get time. Thanks cc @jeremyary

tokoko commented 6 months ago

@redhatHameed thanks, great work. looks good overall, I'll leave the comments in the PR.

redhatHameed commented 6 months ago

@tokoko Thanks for the review and comments.

Before moving further with this approach do you have idea for implementing a security model that aligns with this (arrow flight server )approach and that fits with other components ?

tokoko commented 6 months ago

I might be delving into "speculation" territory here, but I can try to describe a high-level overview of what I'm expecting from the security model.

  1. I'd avoid incorporating user management into feast as much as possible. We should probably have a pluggable authentication module (LDAP, OIDC, etc...) that takes user/password (or token), validates it and spits out the roles that have been assigned to this particular user. Each server will have to integrate with this module separately, http feature server will get user/pass from basic auth, grpc and flight will get them according to their own standard conventions and pass credentials to the module to get the list of assigned roles.

  2. (Option 1) We enrich Feature Registry to also contain information about the roles available in the system and each feast object should be annotated with permissions. In other words, the user would run feast apply with something like this

    
    admin_role = FeastUserRole(name='admin')
    reader_role = FeastUserRole(name='reader')

FeatureView( name=... schema=... ... permissions={ 'read': [role], 'write': [admin_role] } )

3. (Option 2) Another option is to try to mimic AWS IAM and brush up on our regexes. In this case instead of annotating objects with permissions, you're annotating roles with policies.

risk_role = FeastUserRole( name='team_risk_role', permissions=[ FeastPermission( action='read', //read_online, read_offline, write_online, write_offline conditions=[{ 'name': 'veryimportant*', 'team': 'risk' }] ) ] )

FeatureView( name='very_important_fv', schema=... ... tags={ 'team': 'risk' } )


The upside of the second approach is that it's a lot less invasive than the first one. You could potentially end up with a setup where permissions and objects are managed with some level of separation between them. I think I'm more in favor of this.

4. Once a server gets ahold of user roles and permission information from the registry, all components will apply the same "rules engine" to authorize the requested action.

P.S Mostly just thinking out loud here, I might be totally overengineering this 😄 
dmartinol commented 6 months ago

I might be delving into "speculation" territory here, but I can try to describe a high-level overview of what I'm expecting from the security model. ... P.S Mostly just thinking out loud here, I might be totally overengineering this 😄

@tokoko, do we have a separate issue dedicated to discussing these details? It appears that the topic is more generic than just the remote offline feature server, and your hints have been really valuable. Having a dedicated space for discussion until we reach a conclusion might be beneficial.

tokoko commented 5 months ago

No, we don't. I'll go ahead and split the last comment off as a separate issue.

franciscojavierarceo commented 5 months ago

@tokoko whats the benefit of a remote server?aren't we then adding additional network overhead?

The recommended model, in my opinion should be:

  1. use Feast SDK server side
  2. create SDK to operate with this

(2) can be evaluated using the python server code.

This is how I did it previously and it worked well but curious to understand benefits of the remote approach.

tokoko commented 5 months ago

Yes, it would for most actions add an additional network overhead, that's why we're strictly talking about it as an optional add-on. The biggest upside, the only one really, is that a remote deployment allows you to put it behind your own security layer (3A) which is impossible when a client has a direct access to the data with SDK.

2. create SDK to operate with this

I'm not sure what you mean here.

dmartinol commented 5 months ago
  1. create SDK to operate with this

@franciscojavierarceo isn't this the Feast client? I mean, when you configure the client-side store as:

offline_store:
  type: remote
  host: mystore.feast.svc.cluster.local
  port: 8815

and then run the queries like fs.get_historical_features(...), the client transparently operate the server using the regular Feast APIs. There's no need to add another client SDK, the remote store type is actually acting as the client for the offline server. (sorry if I missed the real point! 🤔 )