feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.62k stars 1k forks source link

perf: Add init and cleanup of long lived resources #4642

Closed robhowley closed 4 weeks ago

robhowley commented 1 month ago

What this PR does / why we need it:

update: added pytest-asyncio to ci which is +555 of the diff

Which issue(s) this PR fixes:

Addresses the performance penalty mentioned here by @tokoko

Misc

robhowley commented 1 month ago

not 100% what to make this error in the py39 unit tests

ERROR sdk/python/tests/unit/permissions/auth/server/test_auth_registry_server.py::test_registry_apis[\nauth:\n  type: no_auth\n-applied_permissions0] - RuntimeError: Failed to bind to address [::]:34361; set GRPC_VERBOSITY=debug environment variable to see detailed error message.
franciscojavierarceo commented 1 month ago

@dmartinol @redhatHameed could you take a look?

dmartinol commented 1 month ago

not 100% what to make this error in the py39 unit tests

ERROR sdk/python/tests/unit/permissions/auth/server/test_auth_registry_server.py::test_registry_apis[\nauth:\n  type: no_auth\n-applied_permissions0] - RuntimeError: Failed to bind to address [::]:34361; set GRPC_VERBOSITY=debug environment variable to see detailed error message.

I cloned your repo and it succeeded with no issues, maybe a temporary env issue?

dmartinol commented 1 month ago

Some questions about this change:

@franciscojavierarceo @tokoko

robhowley commented 1 month ago

Some questions about this change:

  • why adding the init/close methods to offline_store if there is no implementation yet?
  • if you really think that offline_store deserves them, you probably need to invoke them on the store instance in offline_store module as well. Probably the feature_server, which actually is more an online_server, will not have an offline_store at all, or may use a remote one.

@franciscojavierarceo @tokoko

yea that's a good point. offline store doesn't make a ton of sense in the server case. was just trying to maintain consistency across the two, but that need not be the case. i'll remove.

tokoko commented 1 month ago

@robhowley thanks, I'm not too sure about the changes though. can you elaborate on why we need this?

Is the client creation actually that expensive that it needs a separate init method? I think what I was pointing out in the comment you linked above was that create_client was being called on every invocation rather than only during the first one and cached afterwards. The first one being slightly slower than others doesn't seem that bad to me.

I get why a separate initializer can be beneficial, but I'm a little uneasy that it pushes us into the mixture of sync/async methods. What if I'm invoking the online read the old-fashioned way w/o asyncio, do I still need to call async initializer first or is that only relevant for async calls?

The same goes for the close method. This has actually come up before in #4401. As I pointed out there, I'm not sure what closing an online store should actually mean. Do calls on a closed online store error out or do they still succeed, but need to recreate the resources? I'd rather not have to answer those questions at all unless you think we absolutely need to.

robhowley commented 1 month ago

The async libs are often (eg aiobotocore, aioboto3, aioredis, aiopg) used in context managers bc they're meant to be closed when we're done w all the relevant operations. Figured this sort of thing comes up a lot and is handled in a few ways

  1. I initially tried to create the client and cache it in the dynamo file for instance as is done for the boto3 sync methods, but then there's no place for you to close it. When using the sdk you'd make an async call which would open the client on first call, have some performance gains on subsequent calls, but then leave the client and associated pool open bc feature store doesn't close and it doesn't expose the provider w the online datastore objects.
  2. There's also no place to create clients at the server level that are then retrieved from the app in the online stores. Adding this wouldn't make much sense bc the sdk is meant to also be usable outside the context of a web service, so it's kinda ruled out.
  3. Expose context manager like functionality so it can be handled both at server level and in the sdk as needed.

I figured it'd be a cleaner boundary adding open and close methods akin to an async context manager on FeatureStore than exposing ways to reach down into the OnlineStore and closing it there. Maybe we just go full async context manager? but that seemed a bit much at first pass. If online stores don't need that initialization and the user calls it anyway, then it's just a no-op.

Since the other aio libraries function similarly it would be a useful construct as async support gets added to other online stores.

robhowley commented 1 month ago

With a little time to mull this over, I'm curious what everyone's thoughts are on this. I'm not attached to the open/close idea, but it does seem like there should be a way to clean up resources. Most of async support will come from libraries expecting to be used in an aync context manager, so some kind of equivalent mechanism should be available.

tokoko commented 1 month ago

@robhowley how would the normal non-async workflow be affected with this? If we put connection creation in an async initializer that won't be called by a non-async user... would we implicitly run an initializer anyway during the first call?

robhowley commented 1 month ago

The sync usage shouldn't be impacted. The async code paths need not be touched by the sync code, so I don't think any async initialization would be needed/called.

# in sdk mode, this shouldn't touch async code
# and as a result shouldn't require any async connections to be created
feature_df = FeatureStore().get_online_features(entities).to_df()
robhowley commented 1 month ago

@franciscojavierarceo @tokoko im not exactly sure what to make of the failing integration tests. created an issues #4658 bc they don't seem to have passed in better part of two months. im assuming you're aware, just (1) thought it worth flagging and (2) curious if you had any opinions on how concerned i should be about it

robhowley commented 1 month ago

@tokoko @franciscojavierarceo btw, some added context on this mr ... i've been running this in prod for a few days and reusing the client reduces event loop lag in the server since creation of the client is a blocking operation. the scalability and speed improvements are obviously quite good as a result. reusing the client, however, should be paired w the ability to clean up the resources.

robhowley commented 1 month ago

would it perhaps be clearer to just go full async context manager? somewhat more explicit that this isnt going to impact the sync code paths?

robhowley commented 1 month ago

@franciscojavierarceo just a heads up, the remaining integration test failure comes from here. the push works, but the subsequent retrieval to confirm it's there, in the case of dynamo, is async. this is a sync function so there's no running event loop. working on a fix to get this green