Open epag opened 2 months ago
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T18:36:58Z
From what I can tell, @dsniff@ is obsolete.
Jesse, have you used @tcpkill@ recently and, if so, where did you get it? I can't see anything relevant in the rhel 8 ubi appstream:
https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi8/8/x86_64/appstream/os/Packages/
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T18:38:12Z
It's been a couple of years, so it might be obsolete, yes.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T18:44:19Z
That's a shame because it sounds like it would be useful in this instance.
I might take a punt on setting a heartbeat in the connection url, mainly because it won't do any harm, but it's a bit disappointing to not be able to test it. I wonder how others have tested it (can't find much), given that the straightforward sad path involves either the client or broker causing an amqp connection to close and generate an exception, which won't test a network failure that causes a tcp/ip socket close/reset.
Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T18:56:41Z
Surely there is another way to achieve the same goal. I don't know how hard it is. Does it involve compiling or getting a different kernel? I see some other tools mentioned out there but I don't have experience with them, nor do I have the ability to find and try tools using this laptop or VMs at NWC.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T19:06:06Z
Jesse wrote:
Surely there is another way to achieve the same goal. I don't know how hard it is. Does it involve compiling or getting a different kernel? I see some other tools mentioned out there but I don't have experience with them, nor do I have the ability to find and try tools using this laptop or VMs at NWC.
There are a few, but browsing s.o. and other forums, the results appear to be very mixed indeed. @ss@ looks like the best option, overall, but the rhel 8 ubi is built with @CONFIG_INET_DIAG_DESTROY@ disabled. So, yes, it would involve building the kernel with that option enabled.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:14:13Z
Ah, this may be simpler than I thought.
https://stackoverflow.com/questions/56211818/how-to-disable-network-for-a-running-docker-container
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:16:13Z
https://docs.docker.com/engine/reference/commandline/network_disconnect/
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:19:32Z
Baseline: no heartbeat on the graphics client, disconnect the @eventsbroker@ container (@255cb9698f53@) from the @wres_wres_net@ bridge network:
$ docker network disconnect wres_wres_net 255cb9698f53
Graphics client is oblivious.
2021-05-28T20:13:17.574+0000 INFO GraphicsClient Finished creating WRES Graphics Client with subscriber identifier 24JAcVOgg7bhUzB-e_GVhCQ6Lac.
2021-05-28T20:13:17.578+0000 INFO GraphicsClient WRES Graphics client 24JAcVOgg7bhUzB-e_GVhCQ6Lac is running.
2021-05-28T20:13:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:14:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:15:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:16:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:17:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:18:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-05-28T20:19:17.579+0000 INFO GraphicsClient Evaluation subscriber 24JAcVOgg7bhUzB-e_GVhCQ6Lac is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:24:34Z
Core client is not oblivious when I push an evaluation through (expected).
2021-05-28T20:24:17.055+0000 INFO BrokerConnectionFactory Retrying connection to amqp://guest:guest@wres-core/?brokerlist='tcp://eventsbroker:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'&failover='nofailover' following 1 failed connection attempts. This is retry 1 of 5.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:31:10Z
Now, repeating the experiment with a 5 second heartbeat on the graphics client I see this. Nice!
2021-05-28T20:30:34.967+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:34.976+0000 ERROR EvaluationSubscriber Message subscriber 5LpIJMo8SSL4eyT_ps0ccpUqdec has been flagged as failed without the possibility of recovery.
wres.events.subscribe.UnrecoverableSubscriberException: Encountered an error on connection DJBJAfdGnTMjNIhkXjiYKhFQER0 owned by subscriber 5LpIJMo8SSL4eyT_ps0ccpUqdec. If a failover policy was configured on the connection factory (e.g., connection retries), then that policy was exhausted before this error was thrown. As such, the error is not recoverable and the subscriber will now stop.
at wres.events.subscribe.EvaluationSubscriber$ConnectionExceptionListener.onException(EvaluationSubscriber.java:1724)
at org.apache.qpid.client.AMQConnection$2.run(AMQConnection.java:1686)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: javax.jms.JMSException: Exception thrown against AMQConnection:
Host: eventsbroker
Port: 5673
Virtual Host:
Client ID: wres-graphics
Active session count: 1: org.apache.qpid.AMQDisconnectedException: Server closed connection and reconnection not permitted.
at org.apache.qpid.client.AMQConnection.convertToJMSException(AMQConnection.java:1627)
at org.apache.qpid.client.AMQConnection.closed(AMQConnection.java:1639)
at org.apache.qpid.client.AMQProtocolHandler.closed(AMQProtocolHandler.java:235)
at org.apache.qpid.client.AMQConnectionDelegate_8_0$ReceiverClosedWaiter.closed(AMQConnectionDelegate_8_0.java:563)
at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:225)
... 1 common frames omitted
Caused by: org.apache.qpid.AMQDisconnectedException: Server closed connection and reconnection not permitted.
at org.apache.qpid.client.AMQProtocolHandler.closed(AMQProtocolHandler.java:236)
... 3 common frames omitted
2021-05-28T20:30:35.036+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:35.038+0000 WARN AMQProtocolHandler Timed out while waiting for heartbeat from peer.
2021-05-28T20:30:35.046+0000 ERROR GraphicsClient While checking the graphics client for the health of its subscribers, discovered a failed subscriber with identifier 5LpIJMo8SSL4eyT_ps0ccpUqdec. The graphics client will now close.
2021-05-28T20:30:35.048+0000 INFO GraphicsClient Closing WRES Graphics Client 5LpIJMo8SSL4eyT_ps0ccpUqdec...
2021-05-28T20:30:35.049+0000 INFO GraphicsClient Closing broker connections wres.eventsbroker.BrokerConnectionFactory@16610890.
2021-05-28T20:30:35.049+0000 INFO BrokerConnectionFactory Closing broker connection factory wres.eventsbroker.BrokerConnectionFactory@16610890 and all associated broker connections.
2021-05-28T20:30:35.050+0000 INFO GraphicsClient Closed WRES Graphics Client 5LpIJMo8SSL4eyT_ps0ccpUqdec, which ran for 'PT1M45.017646S' and processed 0 packets of statistics across 0 evaluations.
2021-05-28T20:30:36.017+0000 INFO GraphicsClient WRES Graphics Client version 20210527-6509b62-dev
.
.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T20:33:51Z
So this is a pretty simple addition that improves resilience. It doesn't address the underlying issues associated with making the wres messaging applications more resilient (end goal), but it does allow a graphics client to exit promptly on a network failure and for a new graphics client to spawn and continue to accept work, thereby mitigating #92536.
Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T23:53:36Z
Some minor enhancements in commit:wres|0034373c3c746369503eaaf173b0a53c6b18ac87.
Author Name: James (James) Original Redmine Issue: 87105, https://vlab.noaa.gov/redmine/issues/87105 Original Date: 2021-01-20 Original Assignee: James
Given an evaluation that is underway When one or more messaging components experience errors that are recoverable, in principle Then the evaluation should recover and succeed in as many situations as practicable
Redmine related issue(s): 90087, 92536, 119833, 121414