Closed juliusgeo closed 1 year ago
@jrbourbeau We have an explanation of why you should share MongoClients here: https://pymongo.readthedocs.io/en/stable/faq.html#how-does-connection-pooling-work-in-pymongo
Thanks, that doc section is useful. My takeaway is since each client is creating a connection pool, it can be expensive to setup/teardown that pool for each query. Does that sound right?
That sounds right to me. Each connection pool carries with it a lot of overhead due in part to the number of threads that are spun up in the background. Currently, we're using
MongoClient(...)
in a context manager to make sure all the resources the client is managing get closed properly. Given we're now holding onto/reusing clients, do we need to handle any client cleanup, or is that somehow done automatically?
Added an atexit
handler!
Currently, we're using MongoClient(...) in a context manager to make sure all the resources the client is managing get closed properly. Given we're now holding onto/reusing clients, do we need to handle any client cleanup, or is that somehow done automatically?
Yes it's best to close the client explicitly. Ideally, we not only want to close the remaining clients at shutdown but also before they are removed from the LRU cache (eg when _CACHE_SIZE+1 clients are created). Unfortunately, this will complicate the implementation since the lru cache itself has no mechanism for this. Some options I see would be to:
Since I believe the vast majority of users will only need a single MongoClient, we should just ignore this problem and go with option 2).
@jrbourbeau Hi, just wanted to check-in and see if there are any outstanding comments you would like addressed before reviewing again?
@juliusgeo James is on PTO for a week he'll be back next week, I'm sure he'll take a look then.
@juliusgeo and @ShaneHarvey this might have dropped in Jame's email notifications. I'll bring this up in our next meeting. In the meantime, this PR seems to introduce a lot of changes making the code harder to maintain, we want to make sure these changes are worth the technical complexity. You mentioned that this will improve performance, can you show us proof of that? Do you have any benchmarks that we can see to compare the before and after this PR?
@ncclementi
Here are benchmarks for the following operation to test with caching:
b = read_mongo(
connection_kwargs={"host": host_uri},
database="new_database",
collection="new_collection",
chunksize=1,
).take(1)
I tested it both cached (with the changes in this PR) and non-cached (without them).
Scheduler | Avg. Time Cached | Avg. Time Not Cached |
---|---|---|
processes | 353 ms ± 34 ms | 2.17 s ± 824 µs |
thread | 348 ms ± 39.1 ms | 2.2 s ± 24.8 ms |
sync | 777 ms ± 564 ms | 2.2 s ± 3.51 ms |
@jrbourbeau for visibility
@jrbourbeau Happy New Year! Just wanted to ping you again for review.
Hi @juliusgeo apologies for such a delayed response. I'll take a look at the new changes here today.
Hi! I am from the Python Driver team at MongoDB. This PR should add the ability to share clients between shared sets of keyword arguments.