fluves / pywaterinfo

Python package to download time series data from waterinfo.be
https://fluves.github.io/pywaterinfo/
MIT License
17 stars 9 forks source link

Long initialization and data loading time due to too much requests in cache #52

Closed frievb closed 1 year ago

frievb commented 2 years ago

Description

I was using pywaterinfo intensely to load a lot of data and was reloading this data multiple times a day to keep track of the last incoming measurements of some measurement stations. Suddenly the initialization time went up from less than a second to 30 - 60 minutes. Likewise, data loading times went up extremely .

Solution

The big amount of data requests probably caused such an increase of data stored in the pywaterinfo cache memory that it slowed down processing speeds. This was easily solved by clearing the cache:

from pywaterinfo import Waterinfo
vmm = Waterinfo("vmm")
vmm.clear_cache()
stijnvanhoey commented 2 years ago

@frievb thanks a lot for reporting the issue and documenting the solution. I added an additional warning to the caching section of the documentation to document this to other users as well, see https://github.com/fluves/pywaterinfo/pull/53

frievb commented 1 year ago

@stijnvanhoey approximately the last month me and my colleagues are having trouble with getting data from Waterinfo via pywaterinfo. Sometimes we succeed in getting data, often not. We cannot see any logic in when we succeed or when we do not. It seems to be a problem related to the cache kept. The problem occurs just after connection with the database is made. F.e. while running following code:

from pywaterinfo import Waterinfo
import logging

loglevel= "DEBUG"
numeric_level = getattr(logging, loglevel.upper(), None)
logging.basicConfig(level=numeric_level, format='%(asctime)s %(name)s %(filename)s %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

vmm = Waterinfo('vmm')
vmm.clear_cache()

The code keeps running on the line vmm = Waterinfo('vmm') The resulting log:

2023-03-31 09:42:41,605 __main__ from pywaterinfo import Waterinfo.py INFO: 2023-03-31 09:42:41.605363
2023-03-31 09:42:41,605 requests_cache.backends __init__.py DEBUG: Initializing backend: sqlite pywaterinfo_cache.sqlite
2023-03-31 09:42:41,605 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: SerializerPipeline(name=pickle, n_stages=2)
2023-03-31 09:42:41,621 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\_USER_\AppData\Local\Temp\pywaterinfo_cache.sqlite:responses
2023-03-31 09:42:41,629 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: None
2023-03-31 09:42:41,630 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\_USER_\AppData\Local\Temp\pywaterinfo_cache.sqlite:redirects
2023-03-31 09:42:41,637 urllib3.connectionpool connectionpool.py DEBUG: Starting new HTTP connection (1): download.waterinfo.be:80
2023-03-31 09:42:41,719 urllib3.connectionpool connectionpool.py DEBUG: [http://download.waterinfo.be:80](https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdownload.waterinfo.be%2F&data=05%7C01%7C%7Cd0042556002b4402d25108db31e59cd0%7C9e2777ed82374ab992782c144d6f6da3%7C0%7C0%7C638158634490507352%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4uKWYCXQyyg2cYd4oBi6ixMIL9Tf9nyKoXHEmlz%2BDtU%3D&reserved=0) "POST /kiwis-auth/token HTTP/1.1" 201 549
2023-03-31 09:42:41,721 root waterinfo.py INFO: Current token expires on 2023-04-01 09:42:41.721972
2023-03-31 09:42:41,723 requests_cache.policy.actions actions.py DEBUG: Cache directives from request headers: CacheDirectives()
2023-03-31 09:42:41,724 requests_cache.policy.actions actions.py DEBUG: Pre-read cache checks: Passed
2023-03-31 09:42:41,731 requests_cache.policy.actions actions.py DEBUG: Post-read cache actions: CacheActions(expire_after=datetime.timedelta(days=7), send_request=True)
2023-03-31 09:42:41,736 urllib3.connectionpool connectionpool.py DEBUG: Starting new HTTPS connection (1): download.waterinfo.be:443
2023-03-31 09:42:41,809 urllib3.connectionpool connectionpool.py DEBUG: [https://download.waterinfo.be:443](https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownload.waterinfo.be%2F&data=05%7C01%7C%7Cd0042556002b4402d25108db31e59cd0%7C9e2777ed82374ab992782c144d6f6da3%7C0%7C0%7C638158634490507352%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tG1wwAR0Fo5AgnXVQWbmRAYMln0tobcwcdD220pL4Zs%3D&reserved=0) "GET /tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC HTTP/1.1" 304 0
2023-03-31 09:42:41,810 requests_cache.policy.actions actions.py DEBUG: Pre-write cache checks: disabled status
2023-03-31 09:42:41,810 requests_cache.policy.actions actions.py DEBUG: Response for URL [https://download.waterinfo.be/tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC](https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownload.waterinfo.be%2Ftsmdownload%2FKiWIS%2FKiWIS%3Frequest%3DgetRequestInfo%26service%3Dkisters%26type%3DQueryServices%26format%3Djson%26datasource%3D1%26timezone%3DUTC&data=05%7C01%7C%7Cd0042556002b4402d25108db31e59cd0%7C9e2777ed82374ab992782c144d6f6da3%7C0%7C0%7C638158634490507352%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BGUGxrtbJRIUQUUoHZqefnoRDaneWdRRg25Ep6yzQTA%3D&reserved=0) has not been modified
2023-03-31 09:42:41,871 root waterinfo.py INFO: Request [https://download.waterinfo.be/tsmdownload/KiWIS/KiWIS?datasource=1&format=json&request=getRequestInfo&service=kisters&timezone=UTC&type=QueryServices](https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownload.waterinfo.be%2Ftsmdownload%2FKiWIS%2FKiWIS%3Fdatasource%3D1%26format%3Djson%26request%3DgetRequestInfo%26service%3Dkisters%26timezone%3DUTC%26type%3DQueryServices&data=05%7C01%7C%7Cd0042556002b4402d25108db31e59cd0%7C9e2777ed82374ab992782c144d6f6da3%7C0%7C0%7C638158634490507352%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jZ2iWeknu6OOWFEke40zbG8RxqCiXK6RDcA8wyE3RyQ%3D&reserved=0) reused from cache.
2023-03-31 09:42:41,873 requests_cache.backends.base base.py INFO: Resetting expiration with: 7 days, 0:00:00

Sometimes, this also occurred while running code just after clearing the cache (with _vmm.clearcache()). So maybe there is something else going on as well?

stijnvanhoey commented 1 year ago

Thanks for reporting. We can certainly improve the caching logic, so let's try to understand what is happening with the cache and provide a better implementation. A first option could be that with the current fixed-period the cache becomes too large to be manageable. Could you please check:

frievb commented 1 year ago
stijnvanhoey commented 1 year ago

Once that file is removed, the issue seems to be solved

As the size is not the issue but the file existence itself, maybe some interference with other processes?

Furthermore, could you provide the versions you use for the packages? When using pip/virtualenv you can use !pip list or with conda !conda list. Of main interest is pywaterinfo itself and requests, requests-cache.

With versions:

Package            Version
------------------ ---------
...
pywaterinfo        0.6.0
requests           2.28.2
requests-cache     1.0.1
...
urllib3            1.26.15

my output is a bit different using your example code:

➜ python
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pywaterinfo import Waterinfo
>>> loglevel= "DEBUG"
>>> numeric_level = getattr(logging, loglevel.upper(), None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'logging' is not defined
>>> import logging
>>> numeric_level = getattr(logging, loglevel.upper(), None)
>>> logging.basicConfig(level=numeric_level, format='%(asctime)s %(name)s %(filename)s %(levelname)s: %(message)s')
>>> vmm = Waterinfo('vmm')
2023-03-31 17:48:44,018 requests_cache.backends __init__.py DEBUG: Initializing backend: sqlite pywaterinfo_cache.sqlite
2023-03-31 17:48:44,018 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: SerializerPipeline(name=pickle, n_stages=2)
2023-03-31 17:48:44,020 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to /tmp/pywaterinfo_cache.sqlite:responses
2023-03-31 17:48:44,040 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: None
2023-03-31 17:48:44,041 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to /tmp/pywaterinfo_cache.sqlite:redirects
2023-03-31 17:48:44,059 requests_cache.policy.actions actions.py DEBUG: Cache directives from request headers: CacheDirectives()
2023-03-31 17:48:44,059 requests_cache.policy.actions actions.py DEBUG: Pre-read cache checks: Passed
2023-03-31 17:48:44,060 requests_cache.policy.actions actions.py DEBUG: Post-read cache actions: CacheActions(expire_after=datetime.timedelta(days=7), send_request=True)
2023-03-31 17:48:44,063 urllib3.connectionpool connectionpool.py DEBUG: Starting new HTTPS connection (1): download.waterinfo.be:443
2023-03-31 17:48:44,190 urllib3.connectionpool connectionpool.py DEBUG: https://download.waterinfo.be:443 "GET /tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC HTTP/1.1" 200 12904
2023-03-31 17:48:44,194 requests_cache.policy.actions actions.py DEBUG: Pre-write cache checks: Passed
2023-03-31 17:48:44,208 root waterinfo.py INFO: Successful waterinfo API request with call https://download.waterinfo.be/tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC
2023-03-31 17:48:44,214 requests_cache.backends.base base.py INFO: Resetting expiration with: 7 days, 0:00:00
2023-03-31 17:48:44,229 requests_cache.backends.base base.py DEBUG: Deleting up to 0 responses
>>> vmm.clear_cache()
2023-03-31 17:48:49,045 requests_cache.backends.base base.py INFO: Clearing all items from the cache
2023-03-31 17:48:49,059 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to /tmp/pywaterinfo_cache.sqlite:responses
2023-03-31 17:48:49,073 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to /tmp/pywaterinfo_cache.sqlite:redirects
frievb commented 1 year ago

The versions of the packages:

Name                    Version        
...       
pywaterinfo               0.6.0          
pyyaml                    6.0            
qt                        5.12.9         
rasterio                  1.2.10         
requests                  2.28.1         
requests-cache            1.0.0          
...

The output of a successful run looks slightly different than a failed run, but looks still slightly different than your output:

2023-04-03 08:50:04,268 requests_cache.backends __init__.py DEBUG: Initializing backend: sqlite pywaterinfo_cache.sqlite
2023-04-03 08:50:04,269 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: SerializerPipeline(name=pickle, n_stages=2)
2023-04-03 08:50:04,273 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\USER\AppData\Local\Temp\pywaterinfo_cache.sqlite:responses
2023-04-03 08:50:04,318 requests_cache.backends.base base.py DEBUG: Initialized SQLiteDict with serializer: None
2023-04-03 08:50:04,322 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\USER\AppData\Local\Temp\pywaterinfo_cache.sqlite:redirects
2023-04-03 08:50:04,364 urllib3.connectionpool connectionpool.py DEBUG: Starting new HTTP connection (1): download.waterinfo.be:80
2023-04-03 08:50:04,439 urllib3.connectionpool connectionpool.py DEBUG: http://download.waterinfo.be:80 "POST /kiwis-auth/token HTTP/1.1" 201 549
2023-04-03 08:50:04,445 root waterinfo.py INFO: Current token expires on 2023-04-04 08:50:04.445426
2023-04-03 08:50:04,454 requests_cache.policy.actions actions.py DEBUG: Cache directives from request headers: CacheDirectives()
2023-04-03 08:50:04,454 requests_cache.policy.actions actions.py DEBUG: Pre-read cache checks: Passed
2023-04-03 08:50:04,456 requests_cache.policy.actions actions.py DEBUG: Post-read cache actions: CacheActions(expire_after=datetime.timedelta(days=7), send_request=True)
2023-04-03 08:50:04,463 urllib3.connectionpool connectionpool.py DEBUG: Starting new HTTPS connection (1): download.waterinfo.be:443
2023-04-03 08:50:04,644 urllib3.connectionpool connectionpool.py DEBUG: https://download.waterinfo.be:443 "GET /tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC HTTP/1.1" 200 12904
2023-04-03 08:50:04,646 requests_cache.policy.actions actions.py DEBUG: Pre-write cache checks: Passed
2023-04-03 08:50:04,667 root waterinfo.py INFO: Successful waterinfo API request with call https://download.waterinfo.be/tsmdownload/KiWIS/KiWIS?request=getRequestInfo&service=kisters&type=QueryServices&format=json&datasource=1&timezone=UTC
2023-04-03 08:50:04,672 requests_cache.backends.base base.py INFO: Resetting expiration with: 7 days, 0:00:00
2023-04-03 08:50:04,702 requests_cache.backends.base base.py DEBUG: Deleting up to 0 responses
2023-04-03 08:50:04,704 requests_cache.backends.base base.py INFO: Clearing all items from the cache
2023-04-03 08:50:04,717 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\USER\AppData\Local\Temp\pywaterinfo_cache.sqlite:responses 
2023-04-03 08:50:04,762 requests_cache.backends.sqlite sqlite.py DEBUG: Opening connection to C:\Users\USER\AppData\Local\Temp\pywaterinfo_cache.sqlite:redirects 
Beramos commented 1 year ago

I also ran into database locking issues when running simultaneous WaterInfo instances because SQLite does not support concurrent writing. So when two WaterInfo instances try to store a response in the cache, one of the instances will run into a filelock on the SQLite database, this locks seems to not be released properly. So the SQLite cache remains locked and the next time while trying to instantiate a new WaterInfo object, pywaterinfo is unable to connect to the cache in the constructor and keeps retrying but it is never able to obtain the lock. Acces to the SQLite cache is probably triggered at this line.

stijnvanhoey commented 1 year ago

(back from holiday period)

Although I'm not able to reproduce the specific error (edited but it seems the behavior is also now effecting the tox-based setup when running the unit tests) of @frievb in a controlled setup, the usage of concurrent writing to the sqlite dbase has some limitations. I propose to do the following adjustments:

@Beramos and @frievb I'll ping you in a PR I'll make asap.

stijnvanhoey commented 1 year ago

@frievb and @Beramos I checked this further and the stale behavior issue happens when I use requests-cache > 1.x.x, but not with older versions of requests-cache.

Could you have a check if this works also for you (as this would be a workaround in the meanwhile):

We"ll make sure it is compatible with the new version as well, but would be good to trace back the issue.

Beramos commented 1 year ago

@frievb and @Beramos I checked this further and the stale behavior issue happens when I use requests-cache > 1.x.x, but not with older versions of requests-cache.

Could you have a check if this works also for you (as this would be a workaround in the meanwhile):

  • install older version (<1.0.0) of requests-cache, e;g. pip install requests-cache=0.9.8 in the environment where pywaterinfo was installed
  • rerun your code and check it the stale behavior is still happening?

We"ll make sure it is compatible with the new version as well, but would be good to trace back the issue.

I can confirm that downgrading request-cache to 0.9.8 resolves my problem with consecutive unit tests (a Waterinfo object is instantiated in each tests) locking the sqlite database.

stijnvanhoey commented 1 year ago

@Beramos and @frievb see https://github.com/fluves/pywaterinfo/pull/61 for an updated version which should support the latest version of requests-cache as well. Would be good if you could test the version of that PR (clone the repo > switch to SVH-cache-refactor branch, pip install -e .[cache] > run code to check behavior) and provide feedback on the PR.

Beramos commented 1 year ago

@stijnvanhoey, I've tested this and non-caching works and when installing requests_cachen it seems that the locking issues have been resolved.

frievb commented 1 year ago

@stijnvanhoey, I also tested both the workaround by downgrading requests-cache and the updated version (when caching was not activated). No more problems occurred in both cases. Thanks for the solving the issue!

stijnvanhoey commented 1 year ago

tackled by #61, closing this issue