Open epag opened 2 months ago
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2018-06-20T18:09:39Z
Seems like something that would make the user experience less annoying until we decide upon a final approach to the UI.
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2018-07-06T17:50:08Z
This is pretty opaquely written (I see many of my tickets are like that). I think it means this: 1) Any and all service calls should return 500 when there is an internal system error such as database connectivity being down, broker connectivity being down, or a null pointer exception, things like that. 2) When we visit the web GUI, maybe there should be a health check service that can test out all this connectivity
Without a web GUI I'm not sure how to solve this ticket. Right now the service endpoints themselves are exposed to the browser and aren't called asynchronously from a GUI. And I think the services are doing the 500 correctly.
I guess it could be partially resolved with a new health check service that tests connectivity between all components.
Original Redmine Comment Author Name: Hank (Hank) Original Date: 2019-07-25T17:33:27Z
Just flagging this as something that appears to impact the WRES GUI and for which there may be other tickets that could be related in the WRES GUI VLab project.
Hank
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T15:07:36Z
Previous commit didn't work great, can't pass a raw @String@, probably need an actual annotated REntity class etc., trying that in commit:5a6117f26 (see #93685#note-15)
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T16:39:31Z
Didn't work, trying again in commit:3a36ebdb2
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T16:59:37Z
Closer:
2021-07-02T16:59:21.754+0000 [main] ERROR wres.tasker.Tasker - Connectivity failure. Shutting down and exiting.
wres.tasker.WresJob$ConnectivityException: Failed to connect to redis at persister:6379
at wres.tasker.WresJob.getWresJob(WresJob.java:226)
at wres.tasker.Tasker.main(Tasker.java:92)
Caused by: java.lang.IllegalArgumentException: Cannot subclass primitive, array or final types: class wres.tasker.WresJob$DummyLiveObject
at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:406)
at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:379)
at net.bytebuddy.ByteBuddy.subclass(ByteBuddy.java:276)
at org.redisson.RedissonLiveObjectService.createProxy(RedissonLiveObjectService.java:774)
at org.redisson.RedissonLiveObjectService.registerClass(RedissonLiveObjectService.java:659)
at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:108)
at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
at wres.tasker.WresJob.getWresJob(WresJob.java:217)
... 1 common frames omitted
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:12:59Z
Closer:
Exception in thread "main" java.lang.IllegalAccessError: class wres.tasker.WresJob$DummyLiveObject$ByteBuddy$pGBXTfUK cannot access its superclass wres.tasker.WresJob$DummyLiveObject (wres.tasker.WresJob$DummyLiveObject$ByteBuddy$pGBXTfUK is in unnamed module of loader net.bytebuddy.dynamic.loading.ByteArrayClassLoader @107f4980; wres.tasker.WresJob$DummyLiveObject is in unnamed module of loader 'app')
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.access$300(ByteArrayClassLoader.java:56)
at net.bytebuddy.dynamic.loading.ByteArrayClassLoader$ClassDefinitionAction.run(ByteArrayClassLoader.java:655)
at net.bytebuddy.dynamic.loading.ByteArrayClassLoader$ClassDefinitionAction.run(ByteArrayClassLoader.java:607)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.findClass(ByteArrayClassLoader.java:376)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:398)
at net.bytebuddy.dynamic.loading.ByteArrayClassLoader.load(ByteArrayClassLoader.java:326)
at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default$WrappingDispatcher.load(ClassLoadingStrategy.java:358)
at net.bytebuddy.dynamic.loading.ClassLoadingStrategy$Default.load(ClassLoadingStrategy.java:144)
at net.bytebuddy.dynamic.TypeResolutionStrategy$Passive.initialize(TypeResolutionStrategy.java:100)
at net.bytebuddy.dynamic.DynamicType$Default$Unloaded.load(DynamicType.java:6292)
at org.redisson.RedissonLiveObjectService.createProxy(RedissonLiveObjectService.java:830)
at org.redisson.RedissonLiveObjectService.registerClass(RedissonLiveObjectService.java:659)
at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:108)
at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
at wres.tasker.WresJob.getWresJob(WresJob.java:217)
at wres.tasker.Tasker.main(Tasker.java:92)
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:29:02Z
Closer:
wres.tasker.WresJob$ConnectivityException: Failed to connect to redis at persister:6379
at wres.tasker.WresJob.getWresJob(WresJob.java:200)
at wres.tasker.Tasker.main(Tasker.java:92)
Caused by: java.lang.IllegalArgumentException: Can't find default constructor for class wres.tasker.DummyLiveObject$ByteBuddy$NJyHySyS
at org.redisson.RedissonLiveObjectService.instantiate(RedissonLiveObjectService.java:718)
at org.redisson.RedissonLiveObjectService.instantiateLiveObject(RedissonLiveObjectService.java:693)
at org.redisson.RedissonLiveObjectService.createLiveObject(RedissonLiveObjectService.java:110)
at org.redisson.RedissonLiveObjectService.attach(RedissonLiveObjectService.java:153)
at wres.tasker.WresJob.getWresJob(WresJob.java:191)
... 1 common frames omitted
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:42:00Z
Fifth time's a charm:
2021-07-02T17:40:31.547+0000 [main] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:40:31.655+0000 [main] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247631551
2021-07-02T17:40:31.655+0000 [main] INFO wres.tasker.Tasker - Up: I will take wres job requests and queue them.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-02T17:43:50Z
Health check is working when hitting the /job url:
2021-07-02T17:42:36.743+0000 [qtp1905280105-72] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:42:36.754+0000 [qtp1905280105-72] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247756749
2021-07-02T17:42:45.493+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:42:45.507+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247765498
2021-07-02T17:43:04.993+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:05.004+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247784996
2021-07-02T17:43:08.257+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:08.264+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247788261
2021-07-02T17:43:10.553+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:10.560+0000 [qtp1905280105-76] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247790556
2021-07-02T17:43:11.908+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:11.914+0000 [qtp1905280105-78] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247791911
2021-07-02T17:43:13.586+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.195:5671
2021-07-02T17:43:13.596+0000 [qtp1905280105-75] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1625247793589
Using @time@ and @curl@ on the same host that's serving -dev COWRES, it takes around 200ms for the health check overall when successful.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T19:53:49Z
A while ago I created a checkmk HTTP COWRES check that calls this service check and alerts us when it is down, it is pointed at production.
Recently, checkmk was not able to be visited due to LDAP and/or certificate issues.
I see checkmk is back up and running and can be visited via web, but the alerts are not coming in. The last alert I see is from Patriot's Day (April 19) 2022, a few weeks ago.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T19:58:36Z
So far it looks like this is done for these components:
But it is missing for these:
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T20:05:03Z
Linked commit:d64a19d228bbaf0069f27026e85cd2aed43b3cf0 because it has the broker connectivity check.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-05-09T20:06:32Z
The tasker doesn't do anything with the database and therefore has no dependency on the database. It only indirectly communicates with the worker-shim which launches WRES which depends on the database. So I suppose in order to even indirectly check database connectivity it would be by running a smoke test job of some kind that runs @connecttodb@ or something like that.
Original Redmine Comment Author Name: James (James) Original Date: 2022-05-09T20:10:57Z
Kind of an aside, but the health check on the eventsbroker (for docker only, not check_mk, of course) is currently extremely rudimentary, it just curls the management console:
HEALTHCHECK CMD curl -f localhost:${BROKER_HTTP_PORT} || exit 1
I suppose this should be placed into a separate script that checks both ports, the one with http(s) protocol and the one with amqp(s) protocol too and perhaps there is a more nuanced and/or more reliable check than the above, probably including retries.
Author Name: Jesse (Jesse) Original Redmine Issue: 50810, https://vlab.noaa.gov/redmine/issues/50810 Original Date: 2018-05-23
Not sure if this applies to when running WRES tool from the command line or running on a platform or both. Assuming the platform. Given a call to WRES m services (or a visit to a UI) When the system cannot be successfully run for programmatically discernible reasons (internal or upstream or otherwise) Then services (and UIs, transitively) should respond with a 500 and a friendly message indicating the system is unavailable
Nice to have: a link for more specific information for developer debugging could be included in the response. This would not show up in the UI, but could show up in the raw responses.