ARROW-105 Cache MongoClients created in read/write_mongo calls

juliusgeo commented 2 years ago

Hi! I am from the Python Driver team at MongoDB. This PR should add the ability to share clients between shared sets of keyword arguments.

juliusgeo commented 2 years ago

@jrbourbeau We have an explanation of why you should share MongoClients here: https://pymongo.readthedocs.io/en/stable/faq.html#how-does-connection-pooling-work-in-pymongo

juliusgeo commented 2 years ago

Thanks, that doc section is useful. My takeaway is since each client is creating a connection pool, it can be expensive to setup/teardown that pool for each query. Does that sound right?

That sounds right to me. Each connection pool carries with it a lot of overhead due in part to the number of threads that are spun up in the background. Currently, we're using MongoClient(...) in a context manager to make sure all the resources the client is managing get closed properly. Given we're now holding onto/reusing clients, do we need to handle any client cleanup, or is that somehow done automatically?

Added an atexit handler!

ShaneHarvey commented 2 years ago

Currently, we're using MongoClient(...) in a context manager to make sure all the resources the client is managing get closed properly. Given we're now holding onto/reusing clients, do we need to handle any client cleanup, or is that somehow done automatically?

Yes it's best to close the client explicitly. Ideally, we not only want to close the remaining clients at shutdown but also before they are removed from the LRU cache (eg when _CACHE_SIZE+1 clients are created). Unfortunately, this will complicate the implementation since the lru cache itself has no mechanism for this. Some options I see would be to:

go overboard: reimplement lru_cache so that we can close clients before discarding them.
ignore the problem: only close clients in the cache at shutdown but don't close clients discarded from the cache at runtime

Since I believe the vast majority of users will only need a single MongoClient, we should just ignore this problem and go with option 2).

juliusgeo commented 1 year ago

@jrbourbeau Hi, just wanted to check-in and see if there are any outstanding comments you would like addressed before reviewing again?

ncclementi commented 1 year ago

@juliusgeo James is on PTO for a week he'll be back next week, I'm sure he'll take a look then.

ncclementi commented 1 year ago

@juliusgeo and @ShaneHarvey this might have dropped in Jame's email notifications. I'll bring this up in our next meeting. In the meantime, this PR seems to introduce a lot of changes making the code harder to maintain, we want to make sure these changes are worth the technical complexity. You mentioned that this will improve performance, can you show us proof of that? Do you have any benchmarks that we can see to compare the before and after this PR?

juliusgeo commented 1 year ago

@ncclementi

Here are benchmarks for the following operation to test with caching:

b = read_mongo(
    connection_kwargs={"host": host_uri},
    database="new_database",
    collection="new_collection",
    chunksize=1,
).take(1)

I tested it both cached (with the changes in this PR) and non-cached (without them).

Scheduler	Avg. Time Cached	Avg. Time Not Cached
processes	353 ms ± 34 ms	2.17 s ± 824 µs
thread	348 ms ± 39.1 ms	2.2 s ± 24.8 ms
sync	777 ms ± 564 ms	2.2 s ± 3.51 ms

ncclementi commented 1 year ago

@jrbourbeau for visibility

juliusgeo commented 1 year ago

@jrbourbeau Happy New Year! Just wanted to ping you again for review.

jrbourbeau commented 1 year ago

Hi @juliusgeo apologies for such a delayed response. I'll take a look at the new changes here today.

coiled / dask-mongo

ARROW-105 Cache MongoClients created in read/write_mongo calls #26