NOAA-OWP / hydrotools

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
Other
53 stars 12 forks source link

`nwis_client` "sqlite3.OperationalError: database is locked" #215

Closed jarq6c closed 1 year ago

jarq6c commented 1 year ago

When running the following code, I get an error. I've noticed this error seems to occur around the ~2500 site mark. That is I'm having trouble retrieving data from more than ~2500 sites at a time. The script raises the error and then freezes until I cancel it with ctrl-c.

Advice?

from hydrotools.nwis_client.iv import IVDataService
from hydrotools.nwm_client.gcp import NWMDataService

model_client = NWMDataService()
xwalk = model_client.crosswalk
xwalk = xwalk[xwalk["usgs_site_code"].str.isdigit()]

obs_client = IVDataService()
data = obs_client.get(sites=xwalk["usgs_site_code"].values)
Traceback (most recent call last):
  File "main.py", line 15, in <module>
    main()
  File "main.py", line 10, in main
    data = obs_client.get(sites=xwalk["usgs_site_code"].values)
  File "~/env/lib64/python3.8/site-packages/hydrotools/nwis_client/_utilities.py", line 93, in wrapper
    return fn(*args, **kwargs)
  File "~/env/lib64/python3.8/site-packages/hydrotools/nwis_client/iv.py", line 236, in get
    raw_data = self.get_raw(
  File "~/env/lib64/python3.8/site-packages/hydrotools/nwis_client/iv.py", line 446, in get_raw
    results = self._restclient.mget(parameters=query_params, headers=self._headers)
  File "~/env/lib64/python3.8/site-packages/forge/_revision.py", line 328, in inner
    return callable(*mapped.args, **mapped.kwargs)
  File "~/env/lib64/python3.8/site-packages/hydrotools/_restclient/_restclient.py", line 162, in mget
    return self._add_to_loop(
  File "~/env/lib64/python3.8/site-packages/hydrotools/_restclient/async_helpers.py", line 23, in _add_to_loop
    return self._loop.run_until_complete(coro)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "~/env/lib64/python3.8/site-packages/forge/_revision.py", line 322, in inner
    return await callable(*mapped.args, **mapped.kwargs)
  File "~/env/lib64/python3.8/site-packages/hydrotools/_restclient/_restclient.py", line 231, in _mget
    return await asyncio.gather(
  File "~/env/lib64/python3.8/site-packages/forge/_revision.py", line 322, in inner
    return await callable(*mapped.args, **mapped.kwargs)
  File "~/env/lib64/python3.8/site-packages/hydrotools/_restclient/_restclient.py", line 184, in _get
    resp = await self._session.get(url, headers=_headers, **kwargs)
  File "~/env/lib64/python3.8/site-packages/forge/_revision.py", line 322, in inner
    return await callable(*mapped.args, **mapped.kwargs)
  File "~/env/lib64/python3.8/site-packages/hydrotools/_restclient/async_client.py", line 49, in _request
    resp = await bound_resp()
  File "~/env/lib64/python3.8/site-packages/forge/_revision.py", line 322, in inner
    return await callable(*mapped.args, **mapped.kwargs)
  File "~/env/lib64/python3.8/site-packages/aiohttp_client_cache/session.py", line 62, in _request
    await self.cache.save_response(new_response, actions.key, actions.expires)
  File "~/env/lib64/python3.8/site-packages/aiohttp_client_cache/backends/base.py", line 169, in save_response
    await self.responses.write(cache_key, cached_response)
  File "~/env/lib64/python3.8/site-packages/aiohttp_client_cache/backends/sqlite.py", line 189, in write
    await super().write(key, sqlite3.Binary(self.serialize(item)))
  File "~/env/lib64/python3.8/site-packages/aiohttp_client_cache/backends/sqlite.py", line 170, in write
    await db.execute(
  File "~/env/lib64/python3.8/site-packages/aiosqlite/core.py", line 190, in execute
    cursor = await self._execute(self._conn.execute, sql, parameters)
  File "~/env/lib64/python3.8/site-packages/aiosqlite/core.py", line 133, in _execute
    return await future
  File "~/env/lib64/python3.8/site-packages/aiosqlite/core.py", line 106, in run
    result = function()
sqlite3.OperationalError: database is locked
aaraney commented 1 year ago

Hmmm, thanks for reporting this @jarq6c! This is a little surprising given that we are not using multiprocessing. This smells like an upstream bug, but i'll need to do a little digging to verify that.

For a little context sqlite, the db used by nwis_client to cache http requests, uses a multiple reader single writer model where when a writer starts a write, the whole db is locked. This consequently means that any readers or writers that try to access the db while the lock is held error out. Libraries that interact with sqlite will often use a spin lock to get around this.

This error is surprising to me because we are using an async approach to retrieving data instead of threads or multiprocessing. It is surprising because I would expect that writing to the cache is a blocking operation. I suspect what is happening is a lot of async writes to the cache (db) (that might be large in size and take more time) and no logic to enforce sqlite's the single writer constraint. Meaning, you need some way of capturing that a write to the db is occurring. Example solutions are a readwrite lock on the db connection object or a retry backoff strategy.

Ill do some digging later this morning and try to find a way around this and report back. Thanks again, @jarq6c!

aaraney commented 1 year ago

Starting to dig now. It looks like others have run into similar problems when using aiosqlite. See https://github.com/omnilib/aiosqlite/issues/234.

aaraney commented 1 year ago

In https://github.com/omnilib/aiosqlite/issues/234 (the issue I previously linked) a provided solution was to limit the number of concurrent asyncio.gathers.

aiosqlite does not use a connection pool. This is likely a limit of the underlying sqlite3 implementation. I'd suggest using a concurrency-limited gather, like aioitertools.asyncio.gather(..., limit=X), especially with many concurrent writes, where only one connection would ever be able to write to a shared database anyways.

asyncio.gather is used in the _restclient's mget method to do concurrent request retrieval which nwis_client is using. I am not keen on placing a concurrency limit at the library level like was suggested above because it could hinder performance for some users depending on their hardware/software. I could be talked into adding a optional keyword argument that exposes this low level detail if we keep running into it. However, I think the most viable solution is to increase the number of sites returned in each request. You can pass the max_sites_per_request keyword argument (current default is 20 sites) to IVDataService.get to effectively change the request chunking size. I think increasing this to something like 50 or 75 will reduce the likelihood of running into this problem. @jarq6c, can you try this and see if you continue to run into the same issue?

jarq6c commented 1 year ago

@aaraney Increasing max_sites_per_request to 50 fixed the issue.

Of note: I haven't been able to get this error to appear on all systems. I expect this is related to the performance of sqlite on different systems? Do you recall how we settled on default number of 20 sites per request?

aaraney commented 1 year ago

Great!

Of note: I haven't been able to get this error to appear on all systems. I expect this is related to the performance of sqlite on different systems?

I think it's less about sqlite's performance per say, but instead io speed. So for example, at the hardware level an SSD vs a HDD and at the software level OS userspace / kernel space switching that's done during reading and writing.

Do you recall how we settled on default number of 20 sites per request?

I will have to look if I have the numbers somewhere, but it was a back-of-the-napkin calculation. I requested data at something like 10, 20, 40, 60, 80, 100, 500 sites and looked at request-response full travel times and 20 was on average the fastest.

jarq6c commented 1 year ago

Barring submission of another issue it's probably safe to close this one. The default of 20 sites per request likely covers most users, so changing it might do more harm than good.

Note the machine on which I got this error was a virtual machine, running on an ancient OS, stored on an HDD, and frequently subject to scans that eat up a significant portion of system resources. So, quite an extraordinary case remedied by a simple solution.

Thanks for the help!