Closed tokoko closed 4 months ago
Hi @tokoko,
wrt to the first option:
1- When you write client-side code should be virtually indistinguishable
it means that when the client invokes fs.get_historical_features(...)
this would be transparently forwarded to the Arrow Server, right? (*) I assume that this may happen because the client feature store has an offline store configured in a similar manner:
project: mystore
provider: local
offline_store:
type: arrow # or remote, or similar
host: mystore.feast.svc.cluster.local
port: 8815
Can you confirm it matches, at least in general terms, your initial idea?
An alternative could be to define an ad-hoc RemoteFeatureStore that prevents the need of such the configuration: fs = RemoteFeatureStore(host='mystore.feast.svc.cluster.local', port=8815)
2- If the Arrow Server wrap calls to offline store implementations
, I assume that Hopsworks usage of the duckdb engine is not needed in our case, as it was meant to replace and optimize the original offline store provider but it would require a specific implementation for each of them. WDYT? Maybe we can have a separate ticket to optimize the queries, instead?
3- Finally, why shouldn't we instead extend the feature server to add an endpoint for the offline features? Is it just a matter of the scale of the data being processed, or are there other factors at play? The reason for asking is that we're going to introduce another service in the architecture, possibly with multiple instances in the same cluster, in case of different repos, and this complexity could confuse users regarding the chosen direction.
(*) To generalize this concept even further, and open the door to apply a consistent security model to the SDK clients, we could also think to implement RemoteOnlineStore and RemoteRegistryStore the same way (based on either the Arrow Server or the existing Feature Server), to simplify the interaction with the remote server(s) and have a unified client API hiding the complexity of the REST/gRPC calls.
Hey, thanks for getting involved here :)
Yes, that's exactly what I mean. The user sets store type to remote
(for example) and FeatureStore
client forwards the requests to a remote server rather than executing directly. And also agreed on the "generalize" point as well, I think we should have remote
options for all three components (online store, offline store, registry):
remote
client as well. (I may be wrong, but I think Lokesh is working on this.)remote
client for it (#3941)remote
client implementation.Yeah, duckdb is just a choice that they made. feast unlike hopsworks tries to cover as many different configurations as possible (for better or worse 😄) and lets the user choose whatever fits their needs. Any performance optimizations should happen separately on engine level. In our case I think the offline server administrator should have an option to set offline store type to any of the engines supported (incl. duckdb) and that will be the one that flight server will be forwarding the requests to.
imho that's the result of inherent duality of feature stores in general, the whole point of it is to be a bridge between offline world and an online one after all. Even if we ignore technical difficulties, I think the usage pattern is very different between these components. online store is almost always queried by some software application and is usually very critical in terms of latency and availability. offline store is sometimes queried by periodical batch production jobs, but is mainly in service of data scientists running their experiments at their own schedule. That's just the conceptual part... From more technical standpoint, The scale is the first obvious difference. Another problem is that we actually already have two feature servers (http and grpc). Which one would we use as a base? We may have to abandon one of them if we take a "single server" route. Also, I've tried in the past to combine flight server with normal protobuf rpc service definitions (unrelated to feast) and failed miserably. It's probably doable, but at the cost of going against the way 99% of people out there do these things.
Just one thought: for easiness of use sdk or remote keep the same feature_store.yaml
file and simply add a flag
to indicate whether it's a remote server. For example:
offline_store:
is_remote_server: true # Flag to indicate whether it's a remote server
type: postgres
host: localhost
port: 5432
feature_store.yaml
content can be encoded into base64 and stored as an environment for offline store server as done online store.
@redhatHameed I'm not sure flag is necessary. If we replicate the way remote registry works, on server side feature_store.yaml
will look identical to how it looks now. The only difference being that someone will need to run feast serve offline
(just an example, something like this) command to start the server. (For a k8s deployment, one would set a command
field to ["feast", "serve", "offline"]
):
offline_store:
type: postgres
host: localhost
port: 5432
On the client-side, the user will have to configure it's client to query the server, it should have no knowledge whatsoever about with type of offline store the server's wrapping. So feature_store.yaml
for client will look something like this:
offline_store:
type: remote-flight
host: mystore.feast.svc.cluster.local
port: 8815
@tokoko Thanks for clarifying and that make sense.
@tokoko just to validate our understanding, is the expectation to implement all the methods of the OfflineStore interface (currently 5) in both the server and the client?
@dmartinol yes, but let me go a bit more into details just to make sure we're on the same page. The first three of those five methods are lazy, in other words they don't return datasets themselves, they return RetrievalJob
objects that can be used by the user to get the dataset. So while yes, there should be a way to invoke GetFlightInfo
on the server with all 3 of them (with dedicated proto message types, I guess), that invocation should only happen when the user does something like the following -> (store.get_historical_featues(...).to_arrow() // or .to_df()
)
if the user instead opts to call to_remote_storage()
method on the RetrievalJob
object, then client implementation will probably have to call the server with similar message types, but on DoAction
endpoint instead of GetFlightInfo
as there's no data to be returned directly.
agree, we must respect the lazy behavior. @redhatHameed we need to experiment on this one as a first step
@tokoko one more clarification what you think about Materializing functionality
will this be part of remote Offline store deployment, as Feast documentation mentioned this as primary functionality.
@dmartinol I think starting with pull_latest_from_table_or_query
might be easiest to get the basic functionality, as there's no need to upload anything the the server.
@redhatHameed that's a very good point. I haven't thought much about it, but I think it should depend on which materialization engine the user has configured.
RemoteMaterializationEngine
(similar to this RemoteOfflineStore
) that will invoke the same flight server to push entire materialization call to the server. So I guess what I'm saying is that the server we will build now will have to (eventually) act as a backend for both remote offline store and remote materialization engine. Does that make sense?Thanks @tokoko in that case we can create a new issue to address the implementation of the RemoteMaterializationEngine
, while keeping the focus of this ticket on implementing the RemoteOfflineStore
. cc @dmartinol
@tokoko, I see that all the implementations of BatchMaterializationEngine
already fetch the offline data using the OfflineStore
interface, as in:
offline_job = self.offline_store.pull_latest_from_table_or_query(...)
That said, if the client doing the materialization already has a remote offline_store
, I assume that the materialize would work through the flight server, correct? What else could a RemoteMaterializationEngine
add, in this case?
yup, It would work through the flight server, but only in the sense that the source data will be pulled from there. All the rest of the logic (conversion to arrow and online_write_batch
calls) will happen locally (in case of local materialization engine, for example). With RemoteMaterializationEngine
, we could offer another implementation that pushes not only pull_latest_from_table_or_query
, but the whole materialization pipeline to the server.
@tokoko me and @dmartinol created a draft PR. To get your initial input on whether we're heading in the right direction. Please take a look when you get time. Thanks cc @jeremyary
@redhatHameed thanks, great work. looks good overall, I'll leave the comments in the PR.
@tokoko Thanks for the review and comments.
Before moving further with this approach do you have idea for implementing a security model that aligns with this (arrow flight server )approach and that fits with other components ?
I might be delving into "speculation" territory here, but I can try to describe a high-level overview of what I'm expecting from the security model.
I'd avoid incorporating user management into feast as much as possible. We should probably have a pluggable authentication module (LDAP, OIDC, etc...) that takes user/password (or token), validates it and spits out the roles that have been assigned to this particular user. Each server will have to integrate with this module separately, http feature server will get user/pass from basic auth, grpc and flight will get them according to their own standard conventions and pass credentials to the module to get the list of assigned roles.
(Option 1) We enrich Feature Registry to also contain information about the roles available in the system and each feast object should be annotated with permissions. In other words, the user would run feast apply
with something like this
admin_role = FeastUserRole(name='admin')
reader_role = FeastUserRole(name='reader')
FeatureView( name=... schema=... ... permissions={ 'read': [role], 'write': [admin_role] } )
3. (Option 2) Another option is to try to mimic AWS IAM and brush up on our regexes. In this case instead of annotating objects with permissions, you're annotating roles with policies.
risk_role = FeastUserRole( name='team_risk_role', permissions=[ FeastPermission( action='read', //read_online, read_offline, write_online, write_offline conditions=[{ 'name': 'veryimportant*', 'team': 'risk' }] ) ] )
FeatureView( name='very_important_fv', schema=... ... tags={ 'team': 'risk' } )
The upside of the second approach is that it's a lot less invasive than the first one. You could potentially end up with a setup where permissions and objects are managed with some level of separation between them. I think I'm more in favor of this.
4. Once a server gets ahold of user roles and permission information from the registry, all components will apply the same "rules engine" to authorize the requested action.
P.S Mostly just thinking out loud here, I might be totally overengineering this 😄
I might be delving into "speculation" territory here, but I can try to describe a high-level overview of what I'm expecting from the security model. ... P.S Mostly just thinking out loud here, I might be totally overengineering this 😄
@tokoko, do we have a separate issue dedicated to discussing these details? It appears that the topic is more generic than just the remote offline feature server, and your hints have been really valuable. Having a dedicated space for discussion until we reach a conclusion might be beneficial.
No, we don't. I'll go ahead and split the last comment off as a separate issue.
@tokoko whats the benefit of a remote server?aren't we then adding additional network overhead?
The recommended model, in my opinion should be:
(2) can be evaluated using the python server code.
This is how I did it previously and it worked well but curious to understand benefits of the remote approach.
Yes, it would for most actions add an additional network overhead, that's why we're strictly talking about it as an optional add-on. The biggest upside, the only one really, is that a remote deployment allows you to put it behind your own security layer (3A) which is impossible when a client has a direct access to the data with SDK.
2. create SDK to operate with this
I'm not sure what you mean here.
- create SDK to operate with this
@franciscojavierarceo isn't this the Feast client? I mean, when you configure the client-side store as:
offline_store:
type: remote
host: mystore.feast.svc.cluster.local
port: 8815
and then run the queries like fs.get_historical_features(...)
, the client transparently operate the server using the regular Feast APIs. There's no need to add another client SDK, the remote
store type is actually acting as the client for the offline server.
(sorry if I missed the real point! 🤔 )
Is your feature request related to a problem? Please describe.
Currently Feast has the capability to deploy both online store (feature server) and registry as standalone services that feast library can access remotely. I'd like to see the same capability with offline store deployment. The primary goal of the feature is to enable fully remote deployment of all feast components and in future enable a common enterprise-level security model for all three components.
Describe the solution you'd like There are two possible general solutions I have in mind:
We can deploy an offline feature server as an Apache Arrow Flight Server that will wrap calls to offline store implementations and expose interface methods as Arrow Flight endpoints. This is also something that Hopsworks seems to have implemented before. They are using duckdb engine behind a flight server.
Offline feature server can be deployed as a lightweight rest or grpc server that only handles requests and metadata and utilizes cloud filesystems for data transfer to and from the user. For example in case of s3, feature server would write out the results as a parquet dataset to a folder in an s3 bucket, generate presigned urls for it and return a list of urls to the caller. The idea is to emulate the mechanics of delta sharing but avoid it's limitations (in delta-sharing, dataset needs be in
delta
format and metadata responses must be immediate, which is obviously not an option for feast).In both scenarios, client-side code should be virtually indistinguishable from other offline store implementations. The particulars of the implementations will be hidden behind RemoteOfflineStore class.