Closed jarq6c closed 1 year ago
Hmmm, thanks for reporting this @jarq6c! This is a little surprising given that we are not using multiprocessing. This smells like an upstream bug, but i'll need to do a little digging to verify that.
For a little context sqlite
, the db used by nwis_client
to cache http requests, uses a multiple reader single writer model where when a writer starts a write, the whole db is locked. This consequently means that any readers or writers that try to access the db while the lock is held error out. Libraries that interact with sqlite will often use a spin lock to get around this.
This error is surprising to me because we are using an async approach to retrieving data instead of threads or multiprocessing. It is surprising because I would expect that writing to the cache is a blocking operation. I suspect what is happening is a lot of async writes to the cache (db) (that might be large in size and take more time) and no logic to enforce sqlite
's the single writer constraint. Meaning, you need some way of capturing that a write to the db is occurring. Example solutions are a readwrite lock on the db connection object or a retry backoff strategy.
Ill do some digging later this morning and try to find a way around this and report back. Thanks again, @jarq6c!
Starting to dig now. It looks like others have run into similar problems when using aiosqlite
. See https://github.com/omnilib/aiosqlite/issues/234.
In https://github.com/omnilib/aiosqlite/issues/234 (the issue I previously linked) a provided solution was to limit the number of concurrent asyncio.gather
s.
aiosqlite does not use a connection pool. This is likely a limit of the underlying sqlite3 implementation. I'd suggest using a concurrency-limited gather, like
aioitertools.asyncio.gather(..., limit=X)
, especially with many concurrent writes, where only one connection would ever be able to write to a shared database anyways.
asyncio.gather
is used in the _restclient
's mget
method to do concurrent request retrieval which nwis_client
is using. I am not keen on placing a concurrency limit at the library level like was suggested above because it could hinder performance for some users depending on their hardware/software. I could be talked into adding a optional keyword argument that exposes this low level detail if we keep running into it. However, I think the most viable solution is to increase the number of sites returned in each request. You can pass the max_sites_per_request
keyword argument (current default is 20
sites) to IVDataService.get
to effectively change the request chunking size. I think increasing this to something like 50 or 75 will reduce the likelihood of running into this problem. @jarq6c, can you try this and see if you continue to run into the same issue?
@aaraney Increasing max_sites_per_request
to 50 fixed the issue.
Of note: I haven't been able to get this error to appear on all systems. I expect this is related to the performance of sqlite
on different systems? Do you recall how we settled on default number of 20 sites per request?
Great!
Of note: I haven't been able to get this error to appear on all systems. I expect this is related to the performance of sqlite on different systems?
I think it's less about sqlite's performance per say, but instead io speed. So for example, at the hardware level an SSD vs a HDD and at the software level OS userspace / kernel space switching that's done during reading and writing.
Do you recall how we settled on default number of 20 sites per request?
I will have to look if I have the numbers somewhere, but it was a back-of-the-napkin calculation. I requested data at something like 10, 20, 40, 60, 80, 100, 500 sites and looked at request-response full travel times and 20 was on average the fastest.
Barring submission of another issue it's probably safe to close this one. The default of 20 sites per request likely covers most users, so changing it might do more harm than good.
Note the machine on which I got this error was a virtual machine, running on an ancient OS, stored on an HDD, and frequently subject to scans that eat up a significant portion of system resources. So, quite an extraordinary case remedied by a simple solution.
Thanks for the help!
When running the following code, I get an error. I've noticed this error seems to occur around the ~2500 site mark. That is I'm having trouble retrieving data from more than ~2500 sites at a time. The script raises the error and then freezes until I cancel it with ctrl-c.
Advice?