NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want improved resilience when recoverable errors occur in messaging components during an evaluation #247

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: James (James) Original Redmine Issue: 87105, https://vlab.noaa.gov/redmine/issues/87105 Original Date: 2021-01-20 Original Assignee: James


Given an evaluation that is underway When one or more messaging components experience errors that are recoverable, in principle Then the evaluation should recover and succeed in as many situations as practicable


Redmine related issue(s): 90087, 92536, 119833, 121414


epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T18:36:58Z


From what I can tell, @dsniff@ is obsolete.

Jesse, have you used @tcpkill@ recently and, if so, where did you get it? I can't see anything relevant in the rhel 8 ubi appstream:

https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi8/8/x86_64/appstream/os/Packages/

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T18:38:12Z


It's been a couple of years, so it might be obsolete, yes.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T18:44:19Z


That's a shame because it sounds like it would be useful in this instance.

I might take a punt on setting a heartbeat in the connection url, mainly because it won't do any harm, but it's a bit disappointing to not be able to test it. I wonder how others have tested it (can't find much), given that the straightforward sad path involves either the client or broker causing an amqp connection to close and generate an exception, which won't test a network failure that causes a tcp/ip socket close/reset.

epag commented 2 months ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T18:56:41Z


Surely there is another way to achieve the same goal. I don't know how hard it is. Does it involve compiling or getting a different kernel? I see some other tools mentioned out there but I don't have experience with them, nor do I have the ability to find and try tools using this laptop or VMs at NWC.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T19:06:06Z


Jesse wrote:

Surely there is another way to achieve the same goal. I don't know how hard it is. Does it involve compiling or getting a different kernel? I see some other tools mentioned out there but I don't have experience with them, nor do I have the ability to find and try tools using this laptop or VMs at NWC.

There are a few, but browsing s.o. and other forums, the results appear to be very mixed indeed. @ss@ looks like the best option, overall, but the rhel 8 ubi is built with @CONFIG_INET_DIAG_DESTROY@ disabled. So, yes, it would involve building the kernel with that option enabled.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:14:13Z


Ah, this may be simpler than I thought.

https://stackoverflow.com/questions/56211818/how-to-disable-network-for-a-running-docker-container

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:16:13Z


https://docs.docker.com/engine/reference/commandline/network_disconnect/

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:19:32Z


Baseline: no heartbeat on the graphics client, disconnect the @eventsbroker@ container (@255cb9698f53@) from the @wres_wres_net@ bridge network:

$ docker network disconnect wres_wres_net 255cb9698f53

Graphics client is oblivious.

2021-05-28T20:13:17.574+0000 INFO GraphicsClient Finished creating WRES Graphics Client with subscriber identifier 24JAcVOgg7bhUzB-e_GVhCQ6Lac.
2021-05-28T20:13:17.578+0000 INFO GraphicsClient WRES Graphics client 24JAcVOgg7bhUzB-e_GVhCQ6Lac is running.
2021-05-28T20:13:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:14:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:15:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:16:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:17:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:18:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:19:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:24:34Z


Core client is not oblivious when I push an evaluation through (expected).

2021-05-28T20:24:17.055+0000 INFO BrokerConnectionFactory Retrying connection to amqp://guest:guest@wres-core/?brokerlist='tcp://eventsbroker:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'&failover='nofailover' following 1 failed connection attempts. This is retry 1 of 5.
epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:31:10Z


Now, repeating the experiment with a 5 second heartbeat on the graphics client I see this. Nice!

2021-05-28T20:30:34.967+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:34.976+0000 ERROR EvaluationSubscriber Message subscriber 5LpIJMo8SSL4eyT_ps0ccpUqdec has been flagged as failed without the possibility of recovery.
wres.events.subscribe.UnrecoverableSubscriberException: Encountered an error on connection DJBJAfdGnTMjNIhkXjiYKhFQER0 owned by subscriber 5LpIJMo8SSL4eyT_ps0ccpUqdec. If a failover policy was configured on the connection factory (e.g., connection retries), then that policy was exhausted before this error was thrown. As such, the error is not recoverable and the subscriber will now stop.
at wres.events.subscribe.EvaluationSubscriber$ConnectionExceptionListener.onException(EvaluationSubscriber.java:1724)
at org.apache.qpid.client.AMQConnection$2.run(AMQConnection.java:1686)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: javax.jms.JMSException: Exception thrown against AMQConnection:
Host: eventsbroker
Port: 5673
Virtual Host:
Client ID: wres-graphics
Active session count: 1: org.apache.qpid.AMQDisconnectedException: Server closed connection and reconnection not permitted.
at org.apache.qpid.client.AMQConnection.convertToJMSException(AMQConnection.java:1627)
at org.apache.qpid.client.AMQConnection.closed(AMQConnection.java:1639)
at org.apache.qpid.client.AMQProtocolHandler.closed(AMQProtocolHandler.java:235)
at org.apache.qpid.client.AMQConnectionDelegate_8_0$ReceiverClosedWaiter.closed(AMQConnectionDelegate_8_0.java:563)
at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:225)
... 1 common frames omitted
Caused by: org.apache.qpid.AMQDisconnectedException: Server closed connection and reconnection not permitted.
at org.apache.qpid.client.AMQProtocolHandler.closed(AMQProtocolHandler.java:236)
... 3 common frames omitted
2021-05-28T20:30:35.036+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:35.038+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:35.046+0000 ERROR GraphicsClient While checking the graphics client for the health of its subscribers, discovered a failed subscriber with identifier 5LpIJMo8SSL4eyT_ps0ccpUqdec. The graphics client will now close.
2021-05-28T20:30:35.048+0000 INFO GraphicsClient Closing WRES Graphics Client 5LpIJMo8SSL4eyT_ps0ccpUqdec...
2021-05-28T20:30:35.049+0000 INFO GraphicsClient Closing broker connections wres.eventsbroker.BrokerConnectionFactory@16610890.
2021-05-28T20:30:35.049+0000 INFO BrokerConnectionFactory Closing broker connection factory wres.eventsbroker.BrokerConnectionFactory@16610890 and all associated broker connections.
2021-05-28T20:30:35.050+0000 INFO GraphicsClient Closed WRES Graphics Client 5LpIJMo8SSL4eyT_ps0ccpUqdec, which ran for 'PT1M45.017646S' and processed 0 packets of statistics across 0 evaluations.
2021-05-28T20:30:36.017+0000 INFO GraphicsClient WRES Graphics Client version 20210527-6509b62-dev
.
.
epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:33:51Z


So this is a pretty simple addition that improves resilience. It doesn't address the underlying issues associated with making the wres messaging applications more resilient (end goal), but it does allow a graphics client to exit promptly on a network failure and for a new graphics client to spawn and continue to accept work, thereby mitigating #92536.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T23:53:36Z


Some minor enhancements in commit:wres|0034373c3c746369503eaaf173b0a53c6b18ac87.