NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, WRES should expose a health check service that tests connectivity to most or all components and dependencies #180

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Jesse (Jesse) Original Redmine Issue: 50810, https://vlab.noaa.gov/redmine/issues/50810 Original Date: 2018-05-23


Not sure if this applies to when running WRES tool from the command line or running on a platform or both. Assuming the platform. Given a call to WRES m services (or a visit to a UI) When the system cannot be successfully run for programmatically discernible reasons (internal or upstream or otherwise) Then services (and UIs, transitively) should respond with a 500 and a friendly message indicating the system is unavailable

Nice to have: a link for more specific information for developer debugging could be included in the response. This would not show up in the UI, but could show up in the raw responses.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2018-06-20T18:09:39Z


Seems like something that would make the user experience less annoying until we decide upon a final approach to the UI.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2018-07-06T17:50:08Z


This is pretty opaquely written (I see many of my tickets are like that). I think it means this: 1) Any and all service calls should return 500 when there is an internal system error such as database connectivity being down, broker connectivity being down, or a null pointer exception, things like that. 2) When we visit the web GUI, maybe there should be a health check service that can test out all this connectivity

Without a web GUI I'm not sure how to solve this ticket. Right now the service endpoints themselves are exposed to the browser and aren't called asynchronously from a GUI. And I think the services are doing the 500 correctly.

I guess it could be partially resolved with a new health check service that tests connectivity between all components.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2019-07-25T17:33:27Z


Just flagging this as something that appears to impact the WRES GUI and for which there may be other tickets that could be related in the WRES GUI VLab project.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T15:07:36Z


Previous commit didn't work great, can't pass a raw @String@, probably need an actual annotated REntity class etc., trying that in commit:5a6117f26 (see #93685#note-15)

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T16:39:31Z


Didn't work, trying again in commit:3a36ebdb2

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T16:59:37Z


Closer:

2021-07-02T16:59:21.754+0000 [main] ERROR wres.tasker.Tasker - Connectivity failure. Shutting down and exiting.
wres.tasker.WresJob$ConnectivityException: Failed to connect to redis at persister:6379
        at wres.tasker.WresJob.getWresJob(WresJob.java:226)
        at wres.tasker.Tasker.main(Tasker.java:92)
Caused by: java.lang.IllegalArgumentException: Cannot subclass primitive, array or final types: class wres.tasker.WresJob$DummyLiveObject
        at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:406)
        at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:379)
        at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:276)
        at org.redisson.RedissonLiveObjectService.createProxy(RedissonLiveObjectService.java:774)
        at org.redisson.RedissonLiveObjectService.registerClass(RedissonLiveObjectService.java:659)
        at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:108)
        at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
        at wres.tasker.WresJob.getWresJob(WresJob.java:217)
        ... 1 common frames omitted
epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:12:59Z


Closer:

Exception in thread "main" java.lang.IllegalAccessError: class wres.tasker.WresJob$DummyLiveObject$ByteBuddy$pGBXTfUK cannot access its superclass wres.tasker.WresJob$DummyLiveObject (wres.tasker.WresJob$DummyLiveObject$ByteBuddy$pGBXTfUK is in unnamed module of loader net.bytebuddy.dynamic.loading.ByteArrayClassLoader @107f4980; wres.tasker.WresJob$DummyLiveObject is in unnamed module of loader 'app')
        at java.base/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
        at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.access$300(ByteArrayClassLoader.java:56)
        at net.bytebuddy.dynamic.loading.ByteArrayClassLoader$ClassDefinitionAction.run(ByteArrayClassLoader.java:655)
        at net.bytebuddy.dynamic.loading.ByteArrayClassLoader$ClassDefinitionAction.run(ByteArrayClassLoader.java:607)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.findClass(ByteArrayClassLoader.java:376)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.load(ByteArrayClassLoader.java:326)
        at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$WrappingDispatcher.load(ClassLoadingStrategy.java:358)
        at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load(ClassLoadingStrategy.java:144)
        at net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize(TypeResolutionStrategy.java:100)
        at net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load(DynamicType.java:6292)
        at org.redisson.RedissonLiveObjectService.createProxy(RedissonLiveObjectService.java:830)
        at org.redisson.RedissonLiveObjectService.registerClass(RedissonLiveObjectService.java:659)
        at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:108)
        at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
        at wres.tasker.WresJob.getWresJob(WresJob.java:217)
        at wres.tasker.Tasker.main(Tasker.java:92)
epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:29:02Z


Closer:

wres.tasker.WresJob$ConnectivityException: Failed to connect to redis at persister:6379
        at wres.tasker.WresJob.getWresJob(WresJob.java:200)
        at wres.tasker.Tasker.main(Tasker.java:92)
Caused by: java.lang.IllegalArgumentException: Can't find default constructor for class wres.tasker.DummyLiveObject$ByteBuddy$NJyHySyS
        at org.redisson.RedissonLiveObjectService.instantiate(RedissonLiveObjectService.java:718)
        at org.redisson.RedissonLiveObjectService.instantiateLiveObject(RedissonLiveObjectService.java:693)
        at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:110)
        at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
        at wres.tasker.WresJob.getWresJob(WresJob.java:191)
        ... 1 common frames omitted
epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:42:00Z


Fifth time's a charm:

2021-07-02T17:40:31.547+0000 [main] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:40:31.655+0000 [main] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247631551
2021-07-02T17:40:31.655+0000 [main] INFO wres.tasker.Tasker - Up: I will take wres job requests and queue them.
epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:43:50Z


Health check is working when hitting the /job url:

2021-07-02T17:42:36.743+0000 [qtp1905280105-72] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:42:36.754+0000 [qtp1905280105-72] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247756749
2021-07-02T17:42:45.493+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:42:45.507+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247765498
2021-07-02T17:43:04.993+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:05.004+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247784996
2021-07-02T17:43:08.257+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:08.264+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247788261
2021-07-02T17:43:10.553+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:10.560+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247790556
2021-07-02T17:43:11.908+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:11.914+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247791911
2021-07-02T17:43:13.586+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:13.596+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247793589

Using @time@ and @curl@ on the same host that's serving -dev COWRES, it takes around 200ms for the health check overall when successful.

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T19:53:49Z


A while ago I created a checkmk HTTP COWRES check that calls this service check and alerts us when it is down, it is pointed at production.

Recently, checkmk was not able to be visited due to LDAP and/or certificate issues.

I see checkmk is back up and running and can be visited via web, but the alerts are not coming in. The last alert I see is from Patriot's Day (April 19) 2022, a few weeks ago.

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T19:58:36Z


So far it looks like this is done for these components:

But it is missing for these:

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T20:05:03Z


Linked commit:d64a19d228bbaf0069f27026e85cd2aed43b3cf0 because it has the broker connectivity check.

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T20:06:32Z


The tasker doesn't do anything with the database and therefore has no dependency on the database. It only indirectly communicates with the worker-shim which launches WRES which depends on the database. So I suppose in order to even indirectly check database connectivity it would be by running a smoke test job of some kind that runs @connecttodb@ or something like that.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-05-09T20:10:57Z


Kind of an aside, but the health check on the eventsbroker (for docker only, not check_mk, of course) is currently extremely rudimentary, it just curls the management console:

HEALTHCHECK CMD curl -f localhost:${BROKER_HTTP_PORT} || exit 1

I suppose this should be placed into a separate script that checks both ports, the one with http(s) protocol and the one with amqp(s) protocol too and perhaps there is a more nuanced and/or more reliable check than the above, probably including retries.