geopython / GeoHealthCheck

Service Status and QoS Checker for OGC Web Services
https://geohealthcheck.org
MIT License
84 stars 71 forks source link

Status is reversed in email notifications #395

Closed justb4 closed 3 years ago

justb4 commented 3 years ago

Describe the bug A bit difficult to describe but suddenly email notifications with the Subject status of e.g. [GeoQoS GeoHealthCheck - Demo] Failing: IDE_Catastro have an email body of an Ok status and vice versa: Fixed: IDE_Catastro has a failed status in the email body. It looks like Run information is not aligned. See screenshot below.

To Reproduce Steps to reproduce the behavior, e.g.:

  1. Add a Resource on a GHC instance
  2. Let it fail
  3. Let it be fixed
  4. Observe reversed status emails

This was observed in the demo.geohealthcheck.org instance.

Expected Behavior We expect the reverse, i.e. Subject matching email body status.

Screenshots or Logfiles Best is to show a recent screenshot from the demo.geohealthcheck.org instance

image

Context (please complete the following information):

Additional context There were several PRs merged on august 30, 2021. One of these could have caused the culprit. It may have to do with "Last Run" I suspect. PRs: #390 #393 #394 #385. @fsteggink @jochemthart1 any idea? Also related to #392 ?

justb4 commented 3 years ago

I mostly suspect PR #394 as it redefines a Resource's last_run....

justb4 commented 3 years ago

Dived in deeper: the status in the Subject Line is ok, but the email body always contains information, like date+time and more, from the last Run minus one (the Run before the last Run)...

justb4 commented 3 years ago

In healthcheck.py a new Run object is created added/committed. Probably the 'cache' using Resource.runs, needs to be cleared with delattr()s...

fsteggink commented 3 years ago

My prime suspect would also be the caching of run results in #394... I can't imagine this would have anything to do with restricting the version of the tzlocal package, DB indexes or anything in the scheduler.

fsteggink commented 3 years ago

By the way, I can imagine that it is attractive to revert #394, but when you have a lot of resources, this will make GHC less usable. Is it possible to deploy a branch with #394 reverted to the test server?

fsteggink commented 3 years ago

As I've implemented #394, the cache is attached to the instance, so when the Resource instance is deleted and info about the resource (even the same one) is obtained, the cached data from the previous instance should not be present. Is the same Resource instance reused over and over in healthcheck.py?

fsteggink commented 3 years ago

By the way, I see a small optimization which can be made to run_resource in healthcheck.py. A resource ID is passed, and then the resource is queried from the DB. This function is called in two places, but in both places the resource object itself is available. If the resource object is passed to run_resource instead of its ID, then this will save a DB query.

fsteggink commented 3 years ago

The line run1 = Run(resource, result, datetime.utcnow()) adds a new Run record to the DB, so if resource.last_run is not reset, then this will point to the run before that. Either delattr can be used, or resource should be requeried (and this will actually nullify the optimization I just proposed...).

However both options are not ideal. Having to call delattr means that you need to have knowledge of the inner working of the Resource class. Also, "knowing" that you have to requery implies the same... This can also cause problems in the future. So the best solution would be that the cache attrs are deleted when you know they're outdated.

fsteggink commented 3 years ago

The actual "problem" sits in the notification_email.txt template which is used by do_email in notifications.py Although both resource and run are passed to the template, resource.last_run is used to get the datetime and message, The web hook (do_webhook) doesn't use resource.last_run, but it only uses uncached resource properties.

justb4 commented 3 years ago

The actual "problem" sits in the notification_email.txt template which is used by do_email in notifications.py Although both resource and run are passed to the template, resource.last_run is used to get the datetime and message, The web hook (do_webhook) doesn't use resource.last_run, but it only uses uncached resource properties.

Aha, then this could be fixed easily. I'll first do that and get back on the other remarks.

From what I learned: the latest versions of SQLAlchemy (v2 I think and up) having caching-features internally. Our SQLAlchemy==1.3.8 and Flask-SQLAlchemy==2.4.0 are really old but upgrades failed earlier, at least will take time. In general I would prefer a solution via SQLAlchemy.

justb4 commented 3 years ago

Closing. I can confirm that the fix in PR #398 works. Don't think webhooks need fixing. But let's still review wherever these cached attributes are used. It is hard to test all scenario's.

fsteggink commented 3 years ago

I've created a new issue (#399) so we're not forgetting to check the caching features in the new SQLAlchemy version.

Regarding checking the scenarios, it can be done relatively quickly by checking for usage of last_run, etc., but this does not prevent new mistakes to be made. But this problem only matters when the Resource object is kept, while the runs are being updated and when it's important to have current information.

justb4 commented 3 years ago

Ok, good! Rethinking PR #398 was probably the only place the cached attributes did not align as there are only two places where new Run objects are created. One was tackled, the other is in the WebUI when adding a new Resource, its first Run is added.

Most of the use is within the WebUI, which also now feels much more responsive! Not sure how long the Resource objects live but the "Runner" processes could in the mean time add new Runs so the cache gets out of sync. We do a similar thing for GetCapabilities for Probe editing, as to prevent the entire Capabilities doc to be fetched on each Edit action, e.g. to show a WFS Featuretypes list. We have defined a timeout for the cache, which is even configurable GHC_METADATA_CACHE_SECS default 900 seconds as the OWS endpoint may have changes in Layers etc. Still it all depends on the lifetime of a Resource object, e.g. if SQLAlchemy does a smart thing when gotten by id....So even with #399 we may run into issues (WebUI and Runner state out of sync) , but let's continue discussing there.