NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want improved resilience when recoverable errors occur in messaging components during an evaluation #247

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: James (James) Original Redmine Issue: 87105, https://vlab.noaa.gov/redmine/issues/87105 Original Date: 2021-01-20 Original Assignee: James


Given an evaluation that is underway When one or more messaging components experience errors that are recoverable, in principle Then the evaluation should recover and succeed in as many situations as practicable


Redmine related issue(s): 90087, 92536, 119833, 121414


epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-20T20:39:39Z


This captures potentially a broad group of failure modes some of which are much easier to address than others.

These failure modes may be categorized by the origin of the failure (e.g., a publisher client, a consumer client, an eventsbroker instance) and the state of the evaluation in progress (not yet negotiated, negotiated but publication not yet started, publication started but consumption not started, publication and consumption both started etc.).

Some of these failure modes already show resilience. For example, there are connection retries by the graphics clients and message delivery retries under many circumstances.

On the other hand, not all failure modes currently show resilience. For example, there are no connection retries on the core client instance.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-20T20:43:52Z


In general, broker failures, without messaging client failures, should be easier to recover from, including when the messaging of the evaluation is underway. This is because resilience is inherently part of the implementation of a broker; more specifically, we use durable queues and persistent messages, so the state of an evaluation is preserved from the perspective of the broker across the failure. From the perspective of each client, the state of the evaluation is maintained in memory because a client has not failed.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-20T20:53:28Z


Conversely, it will be much harder to recover from a messaging client failure when the messaging of an evaluation is underway. This is because the messaging client preserves the state of an evaluation, such as the evaluation description, how many messages should be expected in each message group etc.

In short, it will be necessary to persist this client state across the failure (e.g., using redis/redisson or something else).

The client state may include statistics that have been ack-ed but have not been finally consumed. For example, consider a png graphic composed of several pools of which all but one pool has been ack-ed at the time of failure and the last pool is awaited prior to consumption. In that case, the ack-ed messages have already left the broker and are now stored in client data structures (the precise details depend on whether transactions are used, but they are not, currently). These structures must be recovered.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-20T20:58:06Z


It is worth noting that we already have a degree of resilience at the level of a whole evaluation. If a worker fails, an evaluation in progress is restarted on a new or recovered worker (as I understand it). But restarting a whole evaluation is much easier than allowing a partially complete evaluation to recover, depending on the state of the evaluation at the time of failure. Again, it is much easier to handle the situation where the broker has failed because broker implementations support several recovery mechanisms out-of-the-box.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-20T21:01:23Z


Anyway, the goal of this ticket is to identify and fill gaps where recovery does not currently happen but is practicable to achieve (i.e., where the benefit is proportional to the effort).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T10:48:40Z


Interim goal is to improve retry behavior in order to allow for an evaluation-in-progress (including the messaging portion, in particular) to survive broker restarts. Improvements include retries on publishing and exponential back-off on both publishing and consuming, as well as a reduced list of exceptions that trigger retries on consumption (not all @RuntimeException@).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T13:02:25Z


Have retries on publication and a unit test passing for the low-level @MessagePublisher@.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T14:21:08Z


Much harder that it looks, even for the relatively "simple" case of a broker restart.

First issue is this:

java.lang.IllegalStateException: Cannot reset the expected message count for message group 07142019-21187401 after the group has been consumed.
    at wres.events.subscribe.OneGroupConsumer.setExpectedMessageCount(OneGroupConsumer.java:187)
    at wres.events.subscribe.EvaluationConsumer.setExpectedMessageCountForGroups(EvaluationConsumer.java:804)
    at wres.events.subscribe.EvaluationConsumer.acceptStatusMessage(EvaluationConsumer.java:411)
    at wres.events.subscribe.EvaluationSubscriber.processStatusEvent(EvaluationSubscriber.java:745)
    at wres.events.subscribe.EvaluationSubscriber.lambda$2(EvaluationSubscriber.java:599)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:131)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:177)
    at org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:55)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)

Needs some further debugging, but is it like this?

  1. Message with expected count is sent;
  2. Message with expected count is received and sent to the consumer and consumed;
  3. Broker goes down;
  4. ( Message with expected count would have been ack-ed here if the broker had not died. )
  5. Broker comes back up and redelivers all unack-ed messages that are marked persistent and live on durable queues.
  6. Oh, exception, the consumer has already received this message.

In other words, the consumption of the message and ack-ing of the message are not atomic?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T15:06:52Z


James wrote:

Needs some further debugging, but is it like this?

Message with expected count is sent;

Message with expected count is received and sent to the consumer and consumed;

Broker goes down;

( Message with expected count would have been ack-ed here if the broker had not died. )

Broker comes back up and redelivers all unack-ed messages that are marked persistent and live on durable queues.

Oh, exception, the consumer has already received this message.

Hmm, doesn't appear to be that sequence. Even if the expected-count-message is ack-ed in advance of consumption, the same thing happens. Something about the choreography of the group consumption instead? Or something about the way the broker, upon restart, redelivers (un)ack-ed messages on a durable queue that I am not following...

To be clear, my expectation on the latter is that all unack-ed messages are redelivered and no ack-ed ones are redelivered. But perhaps there is a potential for message duplication that I have not fully understood. Need to look at the JMS API docs more closely too.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T15:29:01Z


James wrote:

Or something about the way the broker, upon restart, redelivers (un)ack-ed messages on a durable queue that I am not following...

To be clear, my expectation on the latter is that all unack-ed messages are redelivered and no ack-ed ones are redelivered. But perhaps there is a potential for message duplication that I have not fully understood. Need to look at the JMS API docs more closely too.

Seems that my understanding is not correct somehow. For example, after a broker restart, I see this evaluation status message appearing twice on a status message listener (once before restart, once after) even with ack-ing in advance of consumption.

clientId: "HOWvYK39RfKS_m4TDcNsofKKprw"
completion_status: GROUP_PUBLICATION_COMPLETE
groupId: "07103797-1529459"
message_count: 36

So it appears that delivery duplication is possible when spanning a broker restart, which makes things much harder. Looking at the docs for more insight into this...

( edit: aside: looks like those two id field names do not follow the protobuf convention - need to fix that. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T15:58:50Z


James wrote:

which makes things much harder.

To be clear about this, I mean that it is harder to design idempotent message consumers (consumers that display the same behavior if they are called multiple times with the same messages) because it means that a unique message identifier needs to be added and tracked with each message and client (and the JMS identifiers are not preserved across clients).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T17:56:58Z


Struggling to find relevant documentation, but created a test harness:

  1. Iterate 100 integers, publishing each one, with a 100ms delay between
  2. Durable subscriber subscribes and a consumer prints received messages to standard out, ack-nowledging each one on receipt
  3. Broker restart happens somewhere during the sequence
  4. Connection retries happen and sequence restarts

I see this (duplication of ack-ed messages):

021-01-25T17:50:29.716+0000 [main] WARN wres.eventsbroker.BrokerConnectionFactory - Probing for an active AMQP broker at the binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&retries='5'&connectdelay='5000'. This may take some time if no active broker exists and retries are configured. If no active broker is discovered, an embedded broker will be started.
2021-01-25T17:50:30.245+0000 [main] INFO wres.eventsbroker.BrokerConnectionFactory - Created a connection factory with name connectionfactory.statisticsFactory and binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&retries='5'&connectdelay='5000'.
0
1
2
3
4
5
6
7
8
9
10
2021-01-25T17:50:31.728+0000 [IoRcvr-/127.0.0.1:55006-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T17:50:31.728+0000 [IoRcvr-/127.0.0.1:55007-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

Would expect the sequence to restart at 11, not 0.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T19:41:43Z


Just weird behavior. If I insert multiple broker restarts into the message sequence, it seems to pick up at the correct position for the one-prior restart. Wtf?

( Also added the @JMSRedelivered@ boolean flag from the amqp metadata to the count to see whether this reveals anything useful and it doesn't. )

2021-01-25T19:34:58.135+0000 [main] WARN wres.eventsbroker.BrokerConnectionFactory - Probing for an active AMQP broker at the binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'. This may take some time if no active broker exists and retries are configured. If no active broker is discovered, an embedded broker will be started.
2021-01-25T19:34:58.713+0000 [main] INFO wres.eventsbroker.BrokerConnectionFactory - Created a connection factory with name connectionfactory.statisticsFactory and binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'.
0: false
1: false
2: false
3: false
4: false
5: false
6: false
7: false
2021-01-25T19:35:06.776+0000 [IoRcvr-/127.0.0.1:58159-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:35:06.777+0000 [IoRcvr-/127.0.0.1:58157-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
0: false
1: false
2: false
3: false
4: false
5: false
6: false
7: false
8: false
9: false
10: false
11: false
12: false
13: false
14: false
15: false
16: false
17: false
18: false
2021-01-25T19:35:22.590+0000 [IoRcvr-/127.0.0.1:58186-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:35:22.590+0000 [IoRcvr-/127.0.0.1:58187-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
8: false
9: false
10: false
11: false
12: false
13: false
14: false
15: false
16: false
17: false
18: false
19: false
20: false
21: false
22: false
23: false
24: false
25: false
26: false
27: false
28: false
29: false
30: false
31: false
32: false
2021-01-25T19:35:43.074+0000 [IoRcvr-/127.0.0.1:58193-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:35:43.074+0000 [IoRcvr-/127.0.0.1:58194-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
19: false
20: false
21: false
22: false
23: false
24: false
25: false
26: false
27: false
28: false
29: false
30: false
31: false
32: false
33: false
34: false
35: false
36: false
37: false
38: false
39: false
40: false
41: false
42: false
43: false
44: false
45: false
46: false
47: false
48: false
49: false
50: false
51: false
52: false
53: false
54: false
55: false
56: false
2021-01-25T19:36:17.355+0000 [IoRcvr-/127.0.0.1:58253-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:36:17.356+0000 [IoRcvr-/127.0.0.1:58252-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
33: false
34: false
35: false
36: false
37: false
38: false
39: false
40: false
41: false
42: false
43: false
44: false
45: false
46: false
47: false
48: false
49: false
50: false
51: false
52: false
53: false
54: false
55: false
56: false
57: false
58: false
59: false
60: false
61: false
62: false
63: false
64: false
65: false
66: false
67: false
68: false
69: false
2021-01-25T19:36:40.913+0000 [IoRcvr-/127.0.0.1:58290-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:36:40.914+0000 [IoRcvr-/127.0.0.1:58289-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
57: false
58: false
59: false
60: false
61: false
62: false
63: false
64: false
65: false
66: false
67: false
68: false
69: false
70: false
71: false
72: false
73: false
74: false
75: false
76: false
77: false
78: false
79: false
80: false
81: false
82: false
83: false
84: false
85: false
86: false
87: false
2021-01-25T19:37:04.483+0000 [IoRcvr-/127.0.0.1:58299-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
2021-01-25T19:37:04.484+0000 [IoRcvr-/127.0.0.1:58298-localhost/127.0.0.1:5673] ERROR org.apache.qpid.client.AMQConnectionDelegate_0_10 - previous exception
org.apache.qpid.transport.ConnectionException: Connection reset
    at org.apache.qpid.transport.Connection.exception(Connection.java:554)
    at org.apache.qpid.transport.network.Assembler.exception(Assembler.java:108)
    at org.apache.qpid.transport.network.InputHandler.exception(InputHandler.java:214)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:220)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:159)
    ... 1 common frames omitted
70: false
71: false
72: false
73: false
74: false
75: false
76: false
77: false
78: false
79: false
80: false
81: false
82: false
83: false
84: false
85: false
86: false
87: false
88: false
89: false
90: false
91: false
92: false
93: false
94: false
95: false
96: false
97: false
98: false
99: false
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T19:58:14Z


Perhaps I need to use the wres broker initial configuration for this test harness, which is backed by the derby db. Perhaps the default broker configuration I used for this test harness is not adequate. Wanted to keep things simple, but perhaps that was not the best approach. Trying the wres broker configuration instead, which is significantly more complex.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T20:10:19Z


Ah. Behavior makes sense with the actual wres configuration (which is significantly more complex, but also represents the culmination of a lot of trial and error), rather than the default broker configuration.

Sequence now progresses seamlessly upon several broker restarts, no exceptions, no evidence of any broker failures other than from the timing (which I have printed and highlighted for broker restarts - bold is where the broker restarted and picked up where it left off, which you can also see from the time delay).


2021-01-25T20:00:41.584+0000 [main] WARN wres.eventsbroker.BrokerConnectionFactory - Probing for an active AMQP broker at the binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'. This may take some time if no active broker exists and retries are configured. If no active broker is discovered, an embedded broker will be started. 2021-01-25T20:00:41.951+0000 [main] INFO wres.eventsbroker.BrokerConnectionFactory - Created a connection factory with name connectionfactory.statisticsFactory and binding URL amqp://guest:guest@test/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'. 2021-01-25T20:00:42.563362900Z: 0: false 2021-01-25T20:00:43.563133800Z: 1: false 2021-01-25T20:00:44.577989800Z: 2: false 2021-01-25T20:00:45.580553500Z: 3: false 2021-01-25T20:00:46.592424800Z: 4: false 2021-01-25T20:00:47.599210500Z: 5: false 2021-01-25T20:00:48.609226200Z: 6: false 2021-01-25T20:00:49.614244400Z: 7: false 2021-01-25T20:00:50.619754Z: 8: false 2021-01-25T20:01:02.472872500Z: 9: false 2021-01-25T20:01:03.412222400Z: 10: false 2021-01-25T20:01:04.416782900Z: 11: false 2021-01-25T20:01:05.422130600Z: 12: false 2021-01-25T20:01:06.426806400Z: 13: false 2021-01-25T20:01:07.443942100Z: 14: false 2021-01-25T20:01:08.444777700Z: 15: false 2021-01-25T20:01:09.453240800Z: 16: false 2021-01-25T20:01:10.449868300Z: 17: false 2021-01-25T20:01:11.460609300Z: 18: false 2021-01-25T20:01:12.471169900Z: 19: false 2021-01-25T20:01:13.480943300Z: 20: false 2021-01-25T20:01:14.481228200Z: 21: false 2021-01-25T20:01:15.485846300Z: 22: false 2021-01-25T20:01:23.348617900Z: 23: false 2021-01-25T20:01:24.301572800Z: 24: false 2021-01-25T20:01:25.309853400Z: 25: false 2021-01-25T20:01:26.321177400Z: 26: false 2021-01-25T20:01:27.329353400Z: 27: false 2021-01-25T20:01:28.332472600Z: 28: false 2021-01-25T20:01:29.334694400Z: 29: false 2021-01-25T20:01:30.346280600Z: 30: false 2021-01-25T20:01:31.350961700Z: 31: false 2021-01-25T20:01:32.358166200Z: 32: false 2021-01-25T20:01:33.369650400Z: 33: false 2021-01-25T20:01:34.370847700Z: 34: false 2021-01-25T20:01:35.373837600Z: 35: false 2021-01-25T20:01:36.385801500Z: 36: false 2021-01-25T20:01:37.389217400Z: 37: false 2021-01-25T20:01:38.401129300Z: 38: false 2021-01-25T20:01:39.414984600Z: 39: false 2021-01-25T20:01:40.416614400Z: 40: false 2021-01-25T20:01:41.424261700Z: 41: false 2021-01-25T20:02:01.575579600Z: 42: false 2021-01-25T20:02:02.533803100Z: 43: false 2021-01-25T20:02:03.534749100Z: 44: false 2021-01-25T20:02:04.542878500Z: 45: false 2021-01-25T20:02:05.555247200Z: 46: false 2021-01-25T20:02:06.556244500Z: 47: false 2021-01-25T20:02:07.562573Z: 48: false 2021-01-25T20:02:08.572104400Z: 49: false 2021-01-25T20:02:09.577905700Z: 50: false 2021-01-25T20:02:10.583641300Z: 51: false 2021-01-25T20:02:11.589776500Z: 52: false 2021-01-25T20:02:12.592471800Z: 53: false 2021-01-25T20:02:13.605229800Z: 54: false 2021-01-25T20:02:14.609621100Z: 55: false 2021-01-25T20:02:15.617449100Z: 56: false 2021-01-25T20:02:16.622777500Z: 57: false 2021-01-25T20:02:17.629138Z: 58: false 2021-01-25T20:02:18.648028400Z: 59: false 2021-01-25T20:02:19.647790500Z: 60: false 2021-01-25T20:02:20.649003Z: 61: false 2021-01-25T20:02:21.653872500Z: 62: false 2021-01-25T20:02:22.661447700Z: 63: false 2021-01-25T20:02:23.659369200Z: 64: false 2021-01-25T20:02:24.671121Z: 65: false 2021-01-25T20:02:25.669708300Z: 66: false 2021-01-25T20:02:26.677286600Z: 67: false 2021-01-25T20:02:27.675744300Z: 68: false 2021-01-25T20:02:28.694578100Z: 69: false 2021-01-25T20:02:29.698843900Z: 70: false 2021-01-25T20:02:48.284580200Z: 71: false 2021-01-25T20:02:49.209067400Z: 72: false 2021-01-25T20:02:50.217329900Z: 73: false 2021-01-25T20:02:51.218758800Z: 74: false 2021-01-25T20:02:52.232734500Z: 75: false 2021-01-25T20:02:53.239737900Z: 76: false 2021-01-25T20:02:54.247809300Z: 77: false 2021-01-25T20:02:55.251804800Z: 78: false 2021-01-25T20:02:56.259247600Z: 79: false 2021-01-25T20:02:57.268324800Z: 80: false 2021-01-25T20:02:58.275629Z: 81: false 2021-01-25T20:02:59.293204400Z: 82: false 2021-01-25T20:03:00.294282200Z: 83: false 2021-01-25T20:03:01.309288300Z: 84: false 2021-01-25T20:03:02.309098100Z: 85: false 2021-01-25T20:03:03.322914800Z: 86: false 2021-01-25T20:03:04.324678100Z: 87: false 2021-01-25T20:03:05.331784100Z: 88: false 2021-01-25T20:03:06.339617700Z: 89: false 2021-01-25T20:03:07.342325500Z: 90: false 2021-01-25T20:03:08.346403500Z: 91: false 2021-01-25T20:03:09.350544800Z: 92: false 2021-01-25T20:03:10.353055600Z: 93: false 2021-01-25T20:03:11.366145200Z: 94: false 2021-01-25T20:03:12.375590500Z: 95: false 2021-01-25T20:03:13.379022400Z: 96: false 2021-01-25T20:03:14.389088800Z: 97: false 2021-01-25T20:03:15.389318600Z: 98: false 2021-01-25T20:03:16.404845Z: 99: false


epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T20:44:14Z


So does it work with the real thing? Running a long evaluation, restarting the broker, seeing what happens...

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T20:50:33Z


( This sounds nice. Would really like to make the switch, but it won't be a small effort because it is coupled with a switch to amqp 1.0, so much more than simply translating broker configurations.

https://activemq.apache.org/components/artemis/documentation/1.1.0/duplicate-detection.html )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T21:34:31Z


James wrote:

So does it work with the real thing? Running a long evaluation, restarting the broker, seeing what happens...

Sort of.

Evaluation survived the broker restart and is ongoing. edit: that is to say, both the core client and an external graphics client. First the core client...

2021-01-25T20:16:57.567+0000 INFO Main WRES version is unknown, probably developer version.
.
.
.
2021-01-25T20:28:42.245+0000 INFO Operations Preliminary metadata loading is complete.
2021-01-25T20:28:45.368+0000 INFO FeatureReporter [1/367] Completed feature tuple '07138063-941030370'
2021-01-25T20:28:48.150+0000 INFO FeatureReporter [2/367] Completed feature tuple '07314500-13660637'
2021-01-25T20:28:49.629+0000 INFO FeatureReporter [3/367] Completed feature tuple '07261090-7817578'
2021-01-25T20:28:49.645+0000 INFO FeatureReporter [4/367] Completed feature tuple '07126415-935058'
2021-01-25T20:28:49.660+0000 INFO FeatureReporter [5/367] Completed feature tuple '07242380-532681'
2021-01-25T20:28:53.158+0000 INFO FeatureReporter [6/367] Completed feature tuple '07141000-21215765'
2021-01-25T20:28:54.707+0000 INFO FeatureReporter [7/367] Completed feature tuple '07250974-7753312'
2021-01-25T20:28:56.091+0000 INFO FeatureReporter [8/367] Completed feature tuple '07231000-484578'
2021-01-25T20:28:56.102+0000 INFO FeatureReporter [9/367] Completed feature tuple '07307028-561081'
2021-01-25T20:28:58.467+0000 INFO FeatureReporter [10/367] Completed feature tuple '07105900-1530241'
2021-01-25T20:28:59.600+0000 INFO FeatureReporter [11/367] Completed feature tuple '07332620-8348635'
2021-01-25T20:29:01.210+0000 INFO FeatureReporter [12/367] Completed feature tuple '07176000-833464'
2021-01-25T20:29:01.213+0000 INFO FeatureReporter [13/367] Completed feature tuple '07340300-3746094'
2021-01-25T20:29:01.219+0000 INFO FeatureReporter [14/367] Completed feature tuple '072632982-22845923'
2021-01-25T20:29:10.062+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:82b6b31b-7e2b-3e50-9ef5-f4d42c53274e and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber rTOItL4_n5XDSoHZoAXygr9f4js, encountered an error. The session will now attempt to recover. This is 1 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /127.0.0.1:59547-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /127.0.0.1:59547-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
    at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
    at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
    at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
    ... 12 more

2021-01-25T20:29:10.062+0000 ERROR EvaluationSubscriber While attempting to recover a session for evaluation gK27pTbY6nLvNPAgh618qvykjYU in subscriber rTOItL4_n5XDSoHZoAXygr9f4js, encountered an error that prevented recovery.
javax.jms.JMSException: Recover failed: Fail-over interrupted suspend/unsuspend channel.
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3821)
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1914)
    at wres.events.subscribe.EvaluationSubscriber.attemptRetry(EvaluationSubscriber.java:923)
    at wres.events.subscribe.EvaluationSubscriber.recover(EvaluationSubscriber.java:862)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:698)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.QpidException: Fail-over interrupted suspend/unsuspend channel.
    at org.apache.qpid.client.AMQSession.suspendChannel(AMQSession.java:3288)
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1884)
    ... 12 common frames omitted
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
    at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
    at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
    ... 1 common frames omitted
2021-01-25T20:29:14.075+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:d3ca8edf-8308-38d8-981c-ba08e08a9041 and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber rTOItL4_n5XDSoHZoAXygr9f4js, encountered an error. The session will now attempt to recover. This is 2 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /127.0.0.1:59547-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /127.0.0.1:59547-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
    at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
    at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
    at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
    ... 12 more

2021-01-25T20:29:14.081+0000 ERROR EvaluationSubscriber While attempting to recover a session for evaluation gK27pTbY6nLvNPAgh618qvykjYU in subscriber rTOItL4_n5XDSoHZoAXygr9f4js, encountered an error that prevented recovery.
javax.jms.JMSException: Recovery was interrupted by fail-over. Recovery status is not known.
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1918)
    at wres.events.subscribe.EvaluationSubscriber.attemptRetry(EvaluationSubscriber.java:923)
    at wres.events.subscribe.EvaluationSubscriber.recover(EvaluationSubscriber.java:862)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:698)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
    at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
    at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
    ... 1 common frames omitted
2021-01-25T20:29:27.217+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:69578fec-c95b-3220-9930-c88bbabe494c and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber rTOItL4_n5XDSoHZoAXygr9f4js, encountered an error. The session will now attempt to recover. This is 3 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.IllegalStateException: has failed over
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:534)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)

2021-01-25T20:32:41.768+0000 INFO FeatureReporter [15/367] Completed feature tuple '07301500-13756121'
2021-01-25T20:33:11.760+0000 INFO FeatureReporter [16/367] Completed feature tuple '07105940-1531241'
2021-01-25T20:33:21.818+0000 INFO FeatureReporter [17/367] Completed feature tuple '07142575-21194132'
2021-01-25T20:36:12.230+0000 INFO FeatureReporter [18/367] Completed feature tuple '07227100-20017399'
2021-01-25T20:36:42.408+0000 INFO FeatureReporter [19/367] Completed feature tuple '07146800-20081438'
2021-01-25T20:37:12.442+0000 INFO FeatureReporter [20/367] Completed feature tuple '07338500-618952'
2021-01-25T20:39:42.676+0000 INFO FeatureReporter [21/367] Completed feature tuple '07183500-20876216'
2021-01-25T20:40:12.898+0000 INFO FeatureReporter [22/367] Completed feature tuple '07241000-393414'
2021-01-25T20:40:43.092+0000 INFO FeatureReporter [23/367] Completed feature tuple '07185095-7583113'
2021-01-25T20:43:13.215+0000 INFO FeatureReporter [24/367] Completed feature tuple '07335300-707434'
2021-01-25T20:43:43.460+0000 INFO FeatureReporter [25/367] Completed feature tuple '07227420-20001250'
2021-01-25T20:44:13.749+0000 INFO FeatureReporter [26/367] Completed feature tuple '07194906-397814'
2021-01-25T20:46:43.742+0000 INFO FeatureReporter [27/367] Completed feature tuple '07257693-7769385'
2021-01-25T20:46:44.238+0000 WARN FeatureReporter [28/367] Completed feature tuple '07312330-13733850', but no statistics were produced. This probably occurred because no pools contained valid pairs.
2021-01-25T20:47:14.043+0000 INFO FeatureReporter [29/367] Completed feature tuple '07263295-22846153'
2021-01-25T20:47:44.267+0000 INFO FeatureReporter [30/367] Completed feature tuple '07331383-430546'
2021-01-25T20:50:14.811+0000 INFO FeatureReporter [31/367] Completed feature tuple '07195400-398128'
2021-01-25T20:50:44.847+0000 INFO FeatureReporter [32/367] Completed feature tuple '07247500-6049306'
2021-01-25T20:51:14.925+0000 INFO FeatureReporter [33/367] Completed feature tuple '07126485-934978'
2021-01-25T20:53:45.371+0000 INFO FeatureReporter [34/367] Completed feature tuple '07184000-20874950'
2021-01-25T20:54:15.471+0000 INFO FeatureReporter [35/367] Completed feature tuple '07231500-6030344'
.
.
.
2021-01-25T21:33:22.036+0000 INFO FeatureReporter [70/367] Completed feature tuple '07247015-6047872'
.
.
.

Exceptions were encountered during the failover (will need to look at those), then finally it fails over and restarts processing.

But the features went from several completed per second before the failover to one feature every ~PT30S afterwards. Latterly, it seems like one feature almost precisely every PT30S, which seems suspicious.

Anyway, will leave it running, but the slowdown needs to be explained. It isn't related to reading messages from disk, it's something else. At this rate it will take another ~PT150M to complete. I will let it run to completion, but there's some stuff to investigate.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T21:36:52Z


Logs of the failover from the graphics client's pov:

2021-01-25T20:16:35.973+0000 INFO GraphicsClient WRES Graphics Client version is unknown, probably developer version.
.
.
.
2021-01-25T20:16:36.022+0000 WARN BrokerConnectionFactory Probing for an active AMQP broker at the binding URL amqp://guest:guest@wres-graphics/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'. This may take some time if no active broker exists and retries are configured. If no active broker is discovered, an embedded broker will be started.
2021-01-25T20:16:36.435+0000 INFO BrokerConnectionFactory Created a connection factory with name connectionfactory.statisticsFactory and binding URL amqp://guest:guest@wres-graphics/?brokerlist='tcp://localhost:5673'&rejectbehaviour='server'&retries='5'&connectdelay='5000'.
2021-01-25T20:16:36.436+0000 INFO GraphicsClient Creating WRES Graphics Client...
2021-01-25T20:16:36.712+0000 INFO EvaluationSubscriber Building a subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo to listen for evaluation messages...
2021-01-25T20:16:37.300+0000 INFO EvaluationSubscriber Started the connections for subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo...
2021-01-25T20:16:37.303+0000 INFO EvaluationSubscriber The subscriber will write outputs to the directory specified by the java.io.tmpdir system property, which is <SNIP>/test
2021-01-25T20:16:37.303+0000 INFO EvaluationSubscriber Created subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo
2021-01-25T20:16:37.304+0000 INFO GraphicsClient Finished creating WRES Graphics Client with subscriber identifier bD9H1hyP6rK4C35JRfDV4rwsPuo.
2021-01-25T20:16:37.306+0000 INFO GraphicsClient WRES Graphics client bD9H1hyP6rK4C35JRfDV4rwsPuo is running.
2021-01-25T20:16:37.307+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 0 evaluations.
2021-01-25T20:17:01.741+0000 INFO EvaluationConsumer Consumer bD9H1hyP6rK4C35JRfDV4rwsPuo opened evaluation gK27pTbY6nLvNPAgh618qvykjYU, which is ready to consume messages.
2021-01-25T20:17:37.311+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:18:37.314+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:19:37.320+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:20:37.334+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:21:37.338+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:22:37.347+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:23:37.359+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:24:37.367+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:25:37.372+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:26:37.384+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:27:37.394+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:28:37.403+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 0 packets of statistics across 1 evaluations.
2021-01-25T20:29:09.393+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:82b6b31b-7e2b-3e50-9ef5-f4d42c53274e and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo, encountered an error. The session will now attempt to recover. This is 1 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /127.0.0.1:59511-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /127.0.0.1:59511-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
    at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
    at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
    at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
    ... 12 more

2021-01-25T20:29:09.393+0000 ERROR EvaluationSubscriber While attempting to recover a session for evaluation gK27pTbY6nLvNPAgh618qvykjYU in subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo, encountered an error that prevented recovery.
javax.jms.JMSException: Recover failed: Fail-over interrupted suspend/unsuspend channel.
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3821)
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1914)
    at wres.events.subscribe.EvaluationSubscriber.attemptRetry(EvaluationSubscriber.java:923)
    at wres.events.subscribe.EvaluationSubscriber.recover(EvaluationSubscriber.java:862)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:698)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.QpidException: Fail-over interrupted suspend/unsuspend channel.
    at org.apache.qpid.client.AMQSession.suspendChannel(AMQSession.java:3288)
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1884)
    ... 12 common frames omitted
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
    at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
    at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
    ... 1 common frames omitted
2021-01-25T20:29:13.396+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:d3ca8edf-8308-38d8-981c-ba08e08a9041 and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo, encountered an error. The session will now attempt to recover. This is 2 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.JMSException: Exception while acknowledging message(s):sender for socket /127.0.0.1:59511-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.client.AMQSession.toJMSException(AMQSession.java:3787)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:544)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.transport.SenderClosedException: sender for socket /127.0.0.1:59511-localhost/127.0.0.1:5673 is closed
    at org.apache.qpid.transport.network.io.IoSender.checkNotAlreadyClosed(IoSender.java:377)
    at org.apache.qpid.transport.network.io.IoSender.send(IoSender.java:122)
    at org.apache.qpid.framing.AMQFrame.writePayload(AMQFrame.java:69)
    at org.apache.qpid.client.AMQProtocolHandler.writeFrame(AMQProtocolHandler.java:575)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeMessage(AMQSession_0_8.java:172)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:144)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
    ... 12 more

2021-01-25T20:29:13.397+0000 ERROR EvaluationSubscriber While attempting to recover a session for evaluation gK27pTbY6nLvNPAgh618qvykjYU in subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo, encountered an error that prevented recovery.
javax.jms.JMSException: Recovery was interrupted by fail-over. Recovery status is not known.
    at org.apache.qpid.client.AMQSession.recover(AMQSession.java:1918)
    at wres.events.subscribe.EvaluationSubscriber.attemptRetry(EvaluationSubscriber.java:923)
    at wres.events.subscribe.EvaluationSubscriber.recover(EvaluationSubscriber.java:862)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:698)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
    at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
    at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
    ... 1 common frames omitted
2021-01-25T20:29:26.481+0000 ERROR EvaluationSubscriber While attempting to consume a message with identifier ID:69578fec-c95b-3220-9930-c88bbabe494c and correlation identifier gK27pTbY6nLvNPAgh618qvykjYU in subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo, encountered an error. The session will now attempt to recover. This is 3 of 4 allowed consumption failures before the subscriber will notify an unrecoverable failure for evaluation gK27pTbY6nLvNPAgh618qvykjYU, unless the subscriber is otherwise marked failed (in which case further retries may not occur). The error is: javax.jms.IllegalStateException: has failed over
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:534)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.subscribe.EvaluationSubscriber.lambda$3(EvaluationSubscriber.java:684)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)

2021-01-25T20:29:37.411+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 493 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:afb01a22-369f-3f48-96be-c1c876314f2a.
2021-01-25T20:30:37.426+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 531 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:b812c8b0-e1a9-36d3-bdb0-89aa857fa501.
2021-01-25T20:31:37.435+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 549 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:c087de6b-358e-3e9e-b933-aaae7321dfae.
2021-01-25T20:32:37.450+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 567 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:748bb82d-504d-3b69-8478-058e2b6973ca.
2021-01-25T20:33:37.461+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 610 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:7893e1ca-bbed-39f2-8b3c-b1cfd64588da.
2021-01-25T20:34:37.471+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 646 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:436e7b58-2aec-386b-ba3e-903a5adde275.
2021-01-25T20:35:37.477+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 664 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:945cdca4-ae78-3a39-a7a4-7ab984c29250.
2021-01-25T20:36:37.477+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 692 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:6ba093dd-a738-3ebd-bcc6-1503396e486b.
2021-01-25T20:37:37.489+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 740 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:08e293fe-2138-3091-86cb-39a7f22d0cba.
2021-01-25T20:38:37.497+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 763 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e3ab6a58-78ce-3557-a3ef-105735dc44e7.
2021-01-25T20:39:37.499+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 781 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:5b4ff1ee-1f03-34f9-bf79-82efdf0f27b8.
2021-01-25T20:40:37.502+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 824 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:d75b131b-7067-33ec-a7ff-2e6d9112ad44.
2021-01-25T20:41:37.515+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 862 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:c27f032b-4888-3206-9a8d-3bdfab593718.
2021-01-25T20:42:37.528+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 880 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:cc48a43d-cc0f-399b-9ec4-e41ee80895d9.
2021-01-25T20:43:37.536+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 908 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:6b56d5e5-351a-3245-8265-8b1778e598a2.
2021-01-25T20:44:37.541+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 956 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:2ceb5224-d126-35fa-a90b-8e6a1edb74f8.
2021-01-25T20:45:37.545+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 979 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:4327b8a9-25ae-3252-8e27-92cea552ec89.
2021-01-25T20:46:37.548+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 997 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e339b89f-764a-3bd6-b079-ed8397988f50.
2021-01-25T20:47:37.550+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1040 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:6b208ec7-3195-3801-a614-9fddc74c01ac.
2021-01-25T20:48:37.555+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1078 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:3d2ac94a-9254-3c6f-9eea-0f3d4a3d4b41.
2021-01-25T20:49:37.570+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1096 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:9eca53ba-94e2-3875-a405-7df69f084b42.
2021-01-25T20:50:37.571+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1124 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:f004ead7-2016-3d5d-b200-07a2726606c0.
2021-01-25T20:51:37.576+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1172 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:533fef04-96ba-3f5f-8aa5-0f2d37c5acde.
2021-01-25T20:52:37.583+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1195 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:4d11121c-0343-3356-b6cf-b12c6cf1c392.
2021-01-25T20:53:37.587+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1213 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:884878a2-7ea1-35e6-acaa-96b468c86712.
2021-01-25T20:54:37.595+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1256 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:06f222e3-74da-3da7-8017-ceef03c429b7.
2021-01-25T20:55:37.600+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1294 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:a1fe11d4-4618-3151-af1e-408d3e11d2c6.
2021-01-25T20:56:37.607+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1312 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:49ed1d05-037e-3a01-81f9-632a482f9179.
2021-01-25T20:57:37.618+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1340 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:8b181b45-7da5-360c-8c2e-1877c275129c.
2021-01-25T20:58:37.622+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1388 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:353b7550-3d5f-3f93-981e-2f2599c804bc.
2021-01-25T20:59:37.630+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1411 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:35db8db4-12f3-3e74-9c19-16d6237cab12.
2021-01-25T21:00:37.633+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1429 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:02938871-9d44-38e8-9404-08297d9e5f39.
2021-01-25T21:01:37.641+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1472 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:3d98dd24-98e1-3d12-98a6-d781bd7d5d3c.
2021-01-25T21:02:37.643+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1510 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:a3b4e680-5c38-37f8-83e7-11261bed5427.
2021-01-25T21:03:37.650+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1528 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:c419fecf-bca3-3dde-bb2f-59be0b8d7a4a.
2021-01-25T21:04:37.655+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1556 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:895fac82-3843-3b0f-8d68-b10cd1769738.
2021-01-25T21:05:37.668+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1597 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e8098b14-58b7-3019-95ec-096f449c074b.
2021-01-25T21:06:37.680+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1625 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:9fa1359d-8183-3f86-970f-5b3689b03a32.
2021-01-25T21:07:37.684+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1643 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:fd426093-09e8-32e0-9250-d602c093c87b.
2021-01-25T21:08:37.693+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1680 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:9c159753-d464-3deb-ac97-2e5e129e7257.
2021-01-25T21:09:37.696+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1723 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e6120edc-679a-3397-9288-95d7dea18c91.
2021-01-25T21:10:37.700+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1741 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:a784e527-2a85-36b9-87c9-6a420f38555f.
2021-01-25T21:11:37.709+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1764 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:23a28b9c-1f58-35d5-a5d8-9fe72d807899.
2021-01-25T21:12:37.714+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1812 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:94c1314f-85de-3c8e-a740-f5f0cd4e5b41.
2021-01-25T21:13:37.729+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1840 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e01d63a9-21a6-37ef-9389-4319cd677d1d.
2021-01-25T21:14:37.736+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1858 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:5dcc798a-b3b0-3fd8-b2ad-9323556a9a7c.
2021-01-25T21:15:37.748+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1896 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:0a67a74c-955b-38ce-b19e-d70099433b63.
2021-01-25T21:16:37.752+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1939 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:2a492a3e-9d88-3a4a-8a01-e0a19555a64b.
2021-01-25T21:17:37.767+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1957 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:5818b09f-d09f-3501-8bce-f438c5c37274.
2021-01-25T21:18:37.781+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 1980 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:75ad8079-f22f-3aa6-987e-ca60e9b724ea.
2021-01-25T21:19:37.781+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2028 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:69d2ab1e-f517-3950-a4f0-6a5ea7907e82.
2021-01-25T21:20:37.788+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2056 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:692e3eb9-c819-38f8-ae80-94a7f6069d06.
2021-01-25T21:21:37.793+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2074 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:16693bd9-f092-3428-b0e2-b58908637131.
2021-01-25T21:22:37.806+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2112 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:b8701a0b-131c-346f-a4f4-8dbc199ec243.
2021-01-25T21:23:37.806+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2155 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:cc50e3f5-9e1e-315a-b073-6716dae20bcf.
2021-01-25T21:24:37.807+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2173 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:50492f4c-142b-3b12-827f-2534af8ef555.
2021-01-25T21:25:37.812+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2196 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:438307b5-9a30-3799-a05d-968ca6949819.
2021-01-25T21:26:37.818+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2244 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:1b326996-9d42-36cd-8144-a37c9a4aeec4.
2021-01-25T21:27:37.829+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2272 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:3153862e-07a2-3866-89fc-cf7359e9a118.
2021-01-25T21:28:37.830+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2290 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:d9609105-5eaa-3eed-8e74-29bd1f1335ad.
2021-01-25T21:29:37.843+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2328 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:e3bc4829-f939-32fc-bafa-4e12210edade.
2021-01-25T21:30:37.857+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2371 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:4eec1236-ee64-3e0e-9abc-2f9528a91626.
2021-01-25T21:31:37.865+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2389 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:d30c8d9b-b3e3-3816-9410-5d1f0026e3fb.
2021-01-25T21:32:37.871+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2412 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:bb45b3f8-5c64-35ef-bcd4-3c7a1507842c.
2021-01-25T21:33:37.880+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2460 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:170f4423-5daa-336d-b301-6fce8d85000e.
2021-01-25T21:34:37.881+0000 INFO GraphicsClient Evaluation subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo is waiting for work. Until now, received 2488 packets of statistics across 1 evaluations. The most recent evaluation was gK27pTbY6nLvNPAgh618qvykjYU and the most recent statistics were attached to message ID:11b090ec-558d-3142-8cd6-014f74cc686c.
.
.
.
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T22:01:54Z


On closer inspection, the delay between features is on a cycle of (PT30S,PT30S,PT150S), so there is definitely something artificial happening there. As of now, it is more like PT320M to completion. Will still wait for completion before debugging though - want evidence of a successful failover.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-25T23:56:56Z


Still chugging away...

2021-01-25T23:55:34.404+0000 INFO FeatureReporter [192/367] Completed feature tuple '07332390-698694'

Glacially slow, but will debug that if/when the evaluation finally succeeds.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T12:17:48Z


.
.
.
2021-01-26T03:26:39.734+0000 INFO FeatureReporter [367/367] Completed feature tuple '07316500-3141592'
2021-01-26T03:26:39.735+0000 INFO Evaluation Publication of messages to evaluation gK27pTbY6nLvNPAgh618qvykjYU has been marked complete. No further messages may be published to this evaluation. Upon completion, 1 evaluation description message, 12916 statistics messages, 0 pairs messages and 13279 evaluation status messages were published to this evaluation.
2021-01-26T03:26:39.735+0000 INFO EvaluationStatusTracker Messaging client IDoqFP9c2nRsQnqMUb10_C3hhcM notified PUBLICATION_COMPLETE_REPORTED_SUCCESS for evaluation gK27pTbY6nLvNPAgh618qvykjYU.
2021-01-26T03:26:39.757+0000 INFO FeatureReporter The following feature tuples succeeded: ( 07138063-941030370, 07314500-13660637, 07261090-7817578, 07126415-935058, 07242380-532681, 07141000-21215765, 07250974-7753312, 07231000-484578, 07307028-561081, 07105900-1530241, 07332620-8348635, 07176000-833464, 07340300-3746094, 072632982-22845923, 07301500-13756121, 07105940-1531241, 07142575-21194132, 07227100-20017399, 07146800-20081438, 07338500-618952, 07183500-20876216, 07241000-393414, 07185095-7583113, 07335300-707434, 07227420-20001250, 07194906-397814, 07257693-7769385, 07263295-22846153, 07331383-430546, 07195400-398128, 07247500-6049306, 07126485-934978, 07184000-20874950, 07231500-6030344, 07094500-920981, 07144550-21165603, 07189100-7600551, 07227000-20031587, 07142019-21187401, 07099060-950343, 07188000-7592691, 07260000-7805591, 07182200-20907947, 07105600-1529805, 07328180-19957106, 07315700-3132401, 07105000-1529743, 07240000-254998, 07141220-22079907, 07144100-21177172, 07119500-981799, 07339000-632727, 07187600-7591093, 07143375-22077857, 07148111-21037175, 07229200-6026378, 07189000-7601075, 07191000-21770039, 07191222-21770993, 07182390-20931078, 07103990-1529539, 07106500-1532253, 07188653-7600617, 07301420-13754931, 07186055-7590877, 07239500-391658, 07311700-13526822, 07247015-6047872, 07160500-251482, 07121500-975692, 07126200-938902, 07337000-9544237, 07096250-947431, 07103797-1529459, 07179750-20930726, 07139000-21251514, 07211500-20050371, 07144200-21177332, 07325840-682279, 07324200-3142490, 07316000-425086, 07185765-7590273, 07189540-7583211, 07216500-20044782, 07140850-21226763, 07145500-21145240, 07252000-7752938, 07148400-21028210, 07221500-20052023, 07144490-21165001, 07165565-367999, 07167500-21517052, 07179500-20929412, 07308200-14003839, 07229050-6026308, 07239450-391180, 07305000-560595, 07191160-21770953, 07239300-390946, 07239700-391680, 07303000-559997, 07324400-678895, 07165570-371305, 07299670-13741403, 07152500-20969524, 07134100-992915, 07227500-19984864, 07325800-683399, 07100750-1529633, 07314900-13660341, 07312100-13732942, 07106000-1530529, 07144780-21160115, 07247000-6048196, 07174400-21784444, 07258500-7803109, 07138020-21250278, 07104000-1529569, 07169500-21514004, 07133000-1596083, 07215500-20044778, 07156900-941040164, 07228500-6024782, 07159750-254068, 07105500-1529755, 07260500-7803469, 07143672-21176300, 072632962-22846151, 07141200-21226759, 07194800-398478, 07260990-7816620, 07309435-673192, 07308500-13971012, 07091200-918597, 07161450-496568, 07143330-22077735, 07208500-20060156, 07238000-390116, 07178000-847304, 07299540-13956737, 07138000-21251316, 07338750-630159, 07176950-847956, 07140900-21215289, 07241520-414275, 07257006-7766271, 07187000-7590701, 07157500-21086452, 07197000-400496, 07312200-13730379, 07334000-707416, 07195855-397590, 07164600-367965, 07261000-7817394, 07183000-20897425, 07226500-20026105, 07103980-1531159, 07108900-952325, 07325000-679877, 07311900-13733388, 07138070-21251336, 07197360-401366, 07177500-847292, 07142680-22076543, 07188838-7600673, 07191179-21771041, 07152000-20988263, 07160000-255270, 072632966-22848005, 07195430-398430, 07081200-916811, 07229300-6027600, 07185090-7583093, 07332500-700694, 07241550-414203, 07099050-947577, 07124410-937330, 07329852-19960370, 07153000-20972352, 07228000-19972126, 07104905-1529719, 07128500-933650, 07227890-19980288, 07145200-21152052, 07195800-399452, 07087050-917441, 07297910-13925051, 07158000-21047070, 07332390-698694, 07249985-1539671, 07141780-21199362, 07105530-1529807, 07116500-1423102, 07147800-21135947, 07170500-21796329, 07250965-7753306, 07299890-13546725, 07165562-368019, 07182510-20890375, 07141300-22080869, 07185910-7590133, 07134180-1601209, 07182250-20906985, 07124000-1594283, 07331300-430036, 07207500-20058852, 07250085-1543899, 07257500-7770981, 07191288-21770855, 07159100-255558, 07261200-7818014, 07126300-938894, 07301410-13754363, 07144300-21165561, 07195865-397706, 07243500-536835, 07151000-21002333, 07099238-950441, 072632971-22846055, 07099970-951131, 07083000-916821, 384037104472001-1531239, 07235000-13812194, 07198000-402478, 07195000-399500, 07105490-1529787, 07263300-22848029, 07312700-13732288, 07180500-20920709, 07234000-13853161, 07237500-388800, 07261250-7818688, 07137010-21248258, 07165750-20957306, 07311000-674726, 07331600-8350531, 07099230-950371, 07194933-397506, 07103970-1529541, 07196000-399514, 07196090-398442, 07311230-575712, 071912213-21771121, 07175500-21784710, 07230000-481940, 07157940-21065946, 07249413-6041518, 07315500-3134443, 07241800-417955, 07263296-22846155, 07260673-7787118, 07083710-916565, 07325860-683417, 07179730-20931060, 07143665-21176042, 07195500-398302, 07079300-916493, 07179700-20929432, 07137500-21247638, 07249400-6041516, 07196500-401664, 07249920-1539341, 07335700-588170, 07300000-13531193, 07141900-941030274, 07180400-20920589, 07177800-847340, 07109500-952589, 07311500-576014, 07206000-20058478, 07164500-369211, 07145700-21166753, 07261500-7830338, 07191220-21771103, 07242000-418429, 07327447-686037, 07228940-6026262, 07170990-21797449, 07184500-20875144, 07332622-8348819, 07176500-847886, 07177650-847326, 07126325-938722, 07263580-22846695, 07311800-13520741, 07207000-20059116, 07106300-1531851, 07250935-7752590, 07126480-935000, 07312500-13734150, 07147070-20078342, 07144480-21165441, 07230500-482410, 07171000-21798265, 07179795-20919029, 07124200-937408, 07303400-13746044, 07336200-592734, 07328500-19958192, 07335790-591268, 07166500-20958460, 07260678-7787336, 07149000-21014649, 07178200-847450, 07331200-429982, 07336820-9544021, 07328100-19958172, 07263012-7835620, 07298500-13928769, 07325850-682419, 07169800-20945608, 07138064-21250364, 07337900-616744, 07099215-948497, 07233500-13837931, 07172000-21782934, 07160350-250456, 07146500-21166925, 07144486-21164973, 07326000-682561, 07312610-13734154, 07329780-19959914, 07143300-21187061, 07144910-21149364, 07249455-1543497, 07151500-20987715, 07140890-21214285, 07203000-20066137, 07186000-7590909, 07315200-13660367, 07103100-1529669, 07300500-13533405, 07189542-7583213, 07130500-1596199, 07340000-3750770, 07341200-3748876, 07245000-512987, 07247250-6048082, 07103800-1531145, 07100300-1529605, 07105945-1530177, 07327550-687397, 07194500-381670, 07103703-1529691, 07154500-21126643, 382629104493000-950417, 07105800-1529977, 07331000-430522, 07157740-21066626, 07190500-21773053, 07196900-400822, 07157950-21046034, 07185000-24778185, 07263555-22846795, 07103700-1529693, 07249800-1540035, 07194880-397364, 07142300-21195628, 07301300-13754389, 07316500-3141592 )

Did not complete. Reports @PUBLICATION_COMPLETE_REPORTED_SUCCESS@ (and the @FeatureReporter@ report is tied to publication - need to align that with the messaging), but does not report @CONSUMPTION_COMPLETE_REPORTED_SUCCESS@ for each consumer, so there is still some consumption to do. When I kill the graphics client, the evaluation starts to report:

2021-01-26T12:11:45.931+0000 INFO EvaluationStatusTracker While completing evaluation gK27pTbY6nLvNPAgh618qvykjYU, awaiting subscriber bD9H1hyP6rK4C35JRfDV4rwsPuo. There is PT1H56M39.8156405S remaining before the subscriber timeout occurs, unless the subscriber provides an update sooner.

In other words, the consumers were all reporting alive (before the graphics client was forcibly killed) and the evaluation is awaiting them to report done, but this is never going to happen, probably because some messages were lost in the failover (e.g., expected message count for a group) or some other orchestration issue occurred during the failover.

Next step is to create a smaller evaluation to test failover and to use debug logging.

edit: would probably also help to add some debug logging for a consumer to report the status of each open evaluation periodically (i.e., what groups are awaiting completion and why - whether some statistics messages have not arrived or the expected message count has not arrived).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T12:25:24Z


This many files generated by the evaluation upon forcible termination:

$ ls | wc -l
27861

Will compare with a successful run of the same evaluation...

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T12:42:58Z


Successful evaluation from the graphics client reports 12,916 statistics received across one evaluation. For the failing evaluation, only 12,881 were reported, so 35 blobs were seemingly lost during the failover. There are, on average, 36 blobs of statistics per feature in this evaluation, so that would make sense if 1 blob was received for a feature that should've received 36 - that feature would never complete.

2021-01-26T12:38:50.722+0000 INFO GraphicsClient Evaluation subscriber 0gSxLyxuc7d-GetHEKdjjSSOKl8 is waiting for work. Until now, received 12916 packets of statistics across 1 evaluations. The most recent evaluation was SyiRzl_vPJ-YZZ1jLw0H3dZOM5k and the most recent statistics were attached to message ID:254d8ffd-8cae-3261-a2c9-b6e3c4884e7c. Evaluation subscriber 0gSxLyxuc7d-GetHEKdjjSSOKl8 completed 1 of the 1 evaluations that were started.
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T12:50:02Z


But comparing the outputs, I see this for the good evaluation:

$ ls | wc -l
28058

In other words, 197 more files than the failed evaluation (27,861).

Precise differences:

$ diff wres_evaluation_gK27pTbY6nLvNPAgh618qvykjYU/dirListing.txt wres_evaluation_SyiRzl_vPJ-YZZ1jLw0H3dZOM5k/dirListing.txt
21810a21811,21888
> 072632982_22845923_NWM_Short_Range_BIAS_FRACTION.csv
> 072632982_22845923_NWM_Short_Range_BIAS_FRACTION.png
> 072632982_22845923_NWM_Short_Range_BOX_PLOT_OF_ERRORS.csv
> 072632982_22845923_NWM_Short_Range_BOX_PLOT_OF_ERRORS.png
> 072632982_22845923_NWM_Short_Range_BOX_PLOT_OF_PERCENTAGE_ERRORS.csv
> 072632982_22845923_NWM_Short_Range_BOX_PLOT_OF_PERCENTAGE_ERRORS.png
> 072632982_22845923_NWM_Short_Range_COEFFICIENT_OF_DETERMINATION.csv
> 072632982_22845923_NWM_Short_Range_COEFFICIENT_OF_DETERMINATION.png
> 072632982_22845923_NWM_Short_Range_INDEX_OF_AGREEMENT.csv
> 072632982_22845923_NWM_Short_Range_INDEX_OF_AGREEMENT.png
> 072632982_22845923_NWM_Short_Range_KLING_GUPTA_EFFICIENCY.csv
> 072632982_22845923_NWM_Short_Range_KLING_GUPTA_EFFICIENCY.png
> 072632982_22845923_NWM_Short_Range_MEAN.csv
> 072632982_22845923_NWM_Short_Range_MEAN_ABSOLUTE_ERROR.csv
> 072632982_22845923_NWM_Short_Range_MEAN_ABSOLUTE_ERROR.png
> 072632982_22845923_NWM_Short_Range_MEAN_ERROR.csv
> 072632982_22845923_NWM_Short_Range_MEAN_ERROR.png
> 072632982_22845923_NWM_Short_Range_MEAN_LEFT.png
> 072632982_22845923_NWM_Short_Range_MEAN_RIGHT.png
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR.csv
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR.png
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE.csv
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE.png
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE_NORMALIZED.csv
> 072632982_22845923_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE_NORMALIZED.png
> 072632982_22845923_NWM_Short_Range_MEDIAN_ERROR.csv
> 072632982_22845923_NWM_Short_Range_MEDIAN_ERROR.png
> 072632982_22845923_NWM_Short_Range_PEARSON_CORRELATION_COEFFICIENT.csv
> 072632982_22845923_NWM_Short_Range_PEARSON_CORRELATION_COEFFICIENT.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_10800_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_10800_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_14400_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_14400_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_18000_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_18000_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_21600_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_21600_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_25200_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_25200_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_28800_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_28800_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_32400_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_32400_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_3600_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_3600_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_36000_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_36000_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_39600_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_39600_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_43200_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_43200_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_46800_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_46800_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_50400_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_50400_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_54000_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_54000_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_57600_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_57600_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_61200_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_61200_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_64800_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_64800_SECONDS.png
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_7200_SECONDS.csv
> 072632982_22845923_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_7200_SECONDS.png
> 072632982_22845923_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR.csv
> 072632982_22845923_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR.png
> 072632982_22845923_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR_NORMALIZED.csv
> 072632982_22845923_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR_NORMALIZED.png
> 072632982_22845923_NWM_Short_Range_SAMPLE_SIZE.csv
> 072632982_22845923_NWM_Short_Range_SAMPLE_SIZE.png
> 072632982_22845923_NWM_Short_Range_STANDARD_DEVIATION.csv
> 072632982_22845923_NWM_Short_Range_STANDARD_DEVIATION_LEFT.png
> 072632982_22845923_NWM_Short_Range_STANDARD_DEVIATION_RIGHT.png
> 072632982_22845923_NWM_Short_Range_SUM_OF_SQUARE_ERROR.csv
> 072632982_22845923_NWM_Short_Range_SUM_OF_SQUARE_ERROR.png
> 072632982_22845923_NWM_Short_Range_VOLUMETRIC_EFFICIENCY.csv
> 072632982_22845923_NWM_Short_Range_VOLUMETRIC_EFFICIENCY.png
22811a22890
> 07301500_13756121_NWM_Short_Range_BIAS_FRACTION.png
22812a22892
> 07301500_13756121_NWM_Short_Range_BOX_PLOT_OF_ERRORS.png
22813a22894
> 07301500_13756121_NWM_Short_Range_BOX_PLOT_OF_PERCENTAGE_ERRORS.png
22814a22896
> 07301500_13756121_NWM_Short_Range_COEFFICIENT_OF_DETERMINATION.png
22815a22898
> 07301500_13756121_NWM_Short_Range_INDEX_OF_AGREEMENT.png
22816a22900
> 07301500_13756121_NWM_Short_Range_KLING_GUPTA_EFFICIENCY.png
22818a22903
> 07301500_13756121_NWM_Short_Range_MEAN_ABSOLUTE_ERROR.png
22819a22905,22907
> 07301500_13756121_NWM_Short_Range_MEAN_ERROR.png
> 07301500_13756121_NWM_Short_Range_MEAN_LEFT.png
> 07301500_13756121_NWM_Short_Range_MEAN_RIGHT.png
22820a22909
> 07301500_13756121_NWM_Short_Range_MEAN_SQUARE_ERROR.png
22821a22911
> 07301500_13756121_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE.png
22822a22913
> 07301500_13756121_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE_NORMALIZED.png
22823a22915
> 07301500_13756121_NWM_Short_Range_MEDIAN_ERROR.png
22824a22917
> 07301500_13756121_NWM_Short_Range_PEARSON_CORRELATION_COEFFICIENT.png
22825a22919
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_10800_SECONDS.png
22826a22921
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_14400_SECONDS.png
22827a22923
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_18000_SECONDS.png
22828a22925
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_21600_SECONDS.png
22829a22927
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_25200_SECONDS.png
22830a22929
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_28800_SECONDS.png
22831a22931
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_32400_SECONDS.png
22832a22933
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_3600_SECONDS.png
22833a22935
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_36000_SECONDS.png
22834a22937
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_39600_SECONDS.png
22835a22939
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_43200_SECONDS.png
22836a22941
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_46800_SECONDS.png
22837a22943
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_50400_SECONDS.png
22838a22945
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_54000_SECONDS.png
22839a22947
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_57600_SECONDS.png
22840a22949
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_61200_SECONDS.png
22841a22951
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_64800_SECONDS.png
22842a22953
> 07301500_13756121_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_7200_SECONDS.png
22843a22955
> 07301500_13756121_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR.png
22844a22957
> 07301500_13756121_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR_NORMALIZED.png
22845a22959
> 07301500_13756121_NWM_Short_Range_SAMPLE_SIZE.png
22846a22961,22962
> 07301500_13756121_NWM_Short_Range_STANDARD_DEVIATION_LEFT.png
> 07301500_13756121_NWM_Short_Range_STANDARD_DEVIATION_RIGHT.png
22847a22964
> 07301500_13756121_NWM_Short_Range_SUM_OF_SQUARE_ERROR.png
22848a22966
> 07301500_13756121_NWM_Short_Range_VOLUMETRIC_EFFICIENCY.png
27606a27725,27802
> 07340300_3746094_NWM_Short_Range_BIAS_FRACTION.csv
> 07340300_3746094_NWM_Short_Range_BIAS_FRACTION.png
> 07340300_3746094_NWM_Short_Range_BOX_PLOT_OF_ERRORS.csv
> 07340300_3746094_NWM_Short_Range_BOX_PLOT_OF_ERRORS.png
> 07340300_3746094_NWM_Short_Range_BOX_PLOT_OF_PERCENTAGE_ERRORS.csv
> 07340300_3746094_NWM_Short_Range_BOX_PLOT_OF_PERCENTAGE_ERRORS.png
> 07340300_3746094_NWM_Short_Range_COEFFICIENT_OF_DETERMINATION.csv
> 07340300_3746094_NWM_Short_Range_COEFFICIENT_OF_DETERMINATION.png
> 07340300_3746094_NWM_Short_Range_INDEX_OF_AGREEMENT.csv
> 07340300_3746094_NWM_Short_Range_INDEX_OF_AGREEMENT.png
> 07340300_3746094_NWM_Short_Range_KLING_GUPTA_EFFICIENCY.csv
> 07340300_3746094_NWM_Short_Range_KLING_GUPTA_EFFICIENCY.png
> 07340300_3746094_NWM_Short_Range_MEAN.csv
> 07340300_3746094_NWM_Short_Range_MEAN_ABSOLUTE_ERROR.csv
> 07340300_3746094_NWM_Short_Range_MEAN_ABSOLUTE_ERROR.png
> 07340300_3746094_NWM_Short_Range_MEAN_ERROR.csv
> 07340300_3746094_NWM_Short_Range_MEAN_ERROR.png
> 07340300_3746094_NWM_Short_Range_MEAN_LEFT.png
> 07340300_3746094_NWM_Short_Range_MEAN_RIGHT.png
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR.csv
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR.png
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE.csv
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE.png
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE_NORMALIZED.csv
> 07340300_3746094_NWM_Short_Range_MEAN_SQUARE_ERROR_SKILL_SCORE_NORMALIZED.png
> 07340300_3746094_NWM_Short_Range_MEDIAN_ERROR.csv
> 07340300_3746094_NWM_Short_Range_MEDIAN_ERROR.png
> 07340300_3746094_NWM_Short_Range_PEARSON_CORRELATION_COEFFICIENT.csv
> 07340300_3746094_NWM_Short_Range_PEARSON_CORRELATION_COEFFICIENT.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_10800_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_10800_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_14400_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_14400_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_18000_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_18000_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_21600_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_21600_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_25200_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_25200_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_28800_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_28800_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_32400_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_32400_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_3600_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_3600_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_36000_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_36000_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_39600_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_39600_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_43200_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_43200_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_46800_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_46800_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_50400_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_50400_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_54000_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_54000_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_57600_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_57600_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_61200_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_61200_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_64800_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_64800_SECONDS.png
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_7200_SECONDS.csv
> 07340300_3746094_NWM_Short_Range_QUANTILE_QUANTILE_DIAGRAM_7200_SECONDS.png
> 07340300_3746094_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR.csv
> 07340300_3746094_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR.png
> 07340300_3746094_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR_NORMALIZED.csv
> 07340300_3746094_NWM_Short_Range_ROOT_MEAN_SQUARE_ERROR_NORMALIZED.png
> 07340300_3746094_NWM_Short_Range_SAMPLE_SIZE.csv
> 07340300_3746094_NWM_Short_Range_SAMPLE_SIZE.png
> 07340300_3746094_NWM_Short_Range_STANDARD_DEVIATION.csv
> 07340300_3746094_NWM_Short_Range_STANDARD_DEVIATION_LEFT.png
> 07340300_3746094_NWM_Short_Range_STANDARD_DEVIATION_RIGHT.png
> 07340300_3746094_NWM_Short_Range_SUM_OF_SQUARE_ERROR.csv
> 07340300_3746094_NWM_Short_Range_SUM_OF_SQUARE_ERROR.png
> 07340300_3746094_NWM_Short_Range_VOLUMETRIC_EFFICIENCY.csv
> 07340300_3746094_NWM_Short_Range_VOLUMETRIC_EFFICIENCY.png

So more than one feature failed to complete, actually three, but one of those completed in part (the products that do not wait for the group to complete).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T12:51:17Z


Bottom line, statistics were lost during the failover, as well as the subsequent features completing very slowly.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T13:00:36Z


James wrote:

Bottom line, statistics were lost during the failover, as well as the subsequent features completing very slowly.

The following is speculation, but the two things probably have the same origin.

See #87105-23, which notes a cycle of completion, (PT30S,PT30S,PT150S). I suspect this cycle is somehow related to the naïve/blocking flow control. A producer thread is only released to produce/publish a message group (feature) when one of two things happen:

  1. All consumers report that another feature has completed; or
  2. A lock to produce could not be acquired within PT10S.

The latter is a safety measure to avoid an insurmountable block. If three features are blocked, then this would on average leave three locks on production that would need to be overcome by the latter condition, amounting to PT30S in total. The reason is that the flow control works by a system of credit. It does not wait for a specific feature to complete consumption (one of the ones currently undergoing production), only for any feature to complete. Thus, the flow control would be in a permanent credit deficit to release flow control locks and would instead keep hitting the timeout condition.

This is speculation, but I think it will be something like that. The main thing to solve is the data loss, which should not happen. I predict that, once solved, the slowness on recovery will disappear too. Ultimately, this naïve/blocking flow control needs to die and I am loathe to make it more sophisticated (such as tracking features 1:1).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T16:40:10Z


Having gone back to the test harness to experiment again about why messages may be lost, there's still a distinct smell emanating either from qpid or my understand of it.

Using basic qpid jms client structures together with the qpid broker and our wres broker configuration (i.e., very little of wres), I can get the broker restarts to happen smoothly only up to a certain flow rate.

If I insert a @Thread.sleep(10)@ into the publication cycle, then a broker restart is seamless and the consumer restarts exactly where it ended, no message loss, regardless of how long the broker is down.

But once I remove that @Thread.sleep@ so that the publication rate is uncorked, then I start to see internal qpid exceptions on failover and message loss on restart, even very early on in the messaging (when the volume of messages in total is just a few MiB or less).

At the moment, I am calling this a qpid smell. It chimes somewhat with a comment here (edit: from 2018), although this user was using a transacted session (we are not):

http://qpid.2158936.n2.nabble.com/JMS-Client-failover-error-stops-message-consumption-td7672116.html

I noticed that if I leave the try/catch around the session.commit() call but
add a sleep of 200 milliseconds after committing each message in the loop
failover is always successful and a JMSException is never raised when doing
a session.commit() call.  It almost seems like a threading/race condition
issue but that's just a complete guess.

The problem is that this qpid stack is legacy, so it's pointless to submit bug reports. Hmmm. Still looking though - the real smell with qpid is the documentation.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T16:48:32Z


When uncorked, this is the first exception I see:

java.lang.IllegalStateException: Fail-over interrupted no-op failover support. No-op support should only be used where the caller is certain fail-over cannot occur.
    at org.apache.qpid.client.failover.FailoverNoopSupport.execute(FailoverNoopSupport.java:71)
    at org.apache.qpid.client.AMQSession_0_8.isBound(AMQSession_0_8.java:467)
    at org.apache.qpid.client.AMQSession_0_8.sync(AMQSession_0_8.java:1010)
    at org.apache.qpid.client.AMQSession_0_8.acknowledgeImpl(AMQSession_0_8.java:152)
    at org.apache.qpid.client.AMQSession.acknowledge(AMQSession.java:539)
    at org.apache.qpid.client.message.AbstractAMQMessageDelegate.acknowledge(AbstractAMQMessageDelegate.java:219)
    at org.apache.qpid.client.message.AbstractJMSMessage.acknowledge(AbstractJMSMessage.java:311)
    at wres.events.publish.MessagePublisherTest$2.onMessage(MessagePublisherTest.java:211)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:758)
    at org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:732)
    at org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3705)
    at org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3637)
    at org.apache.qpid.client.AMQSession$Dispatcher.access$1300(AMQSession.java:3406)
    at org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3399)
    at org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
    at org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3547)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.qpid.client.failover.FailoverException: Failing over about to start
    at org.apache.qpid.client.AMQProtocolHandler.notifyFailoverStarting(AMQProtocolHandler.java:437)
    at org.apache.qpid.client.AMQProtocolHandler$1.run(AMQProtocolHandler.java:286)
    ... 1 common frames omitted
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T16:50:41Z


https://issues.apache.org/jira/browse/QPID-855

Hmmm, not encouraging.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T18:18:03Z


James wrote:

https://issues.apache.org/jira/browse/QPID-855

Hmmm, not encouraging.

Have been experimenting with broker configuration options.

Found this bug when declaring the @wres-statistics@ virtual host node with this stringified json (broker starts, but crashes on node creation when a client requests a subscription):

    "virtualHostInitialConfiguration": "{\"type\": \"DERBY\", \"connectionThreadPoolSize\": \"8\", \"numberOfSelectors\": \"1\", \"nodeAutoCreationPolicies\": [{\"pattern\":\".*\",\"createdOnPublish\":\"true\",\"createdOnConsume\":\"true\",\"nodeType\":\"queue\"}] }"
########################################################################
#
1. Unhandled Exception java.lang.NullPointerException in Thread IO-/127.0.0.1:550
69
#
1. Exiting
#
########################################################################
java.lang.NullPointerException
        at java.base/java.util.HashMap.putMapEntries(HashMap.java:496)
        at java.base/java.util.HashMap.<init>(HashMap.java:485)
        at org.apache.qpid.server.virtualhost.AbstractVirtualHost.autoCreateNode
(AbstractVirtualHost.java:1331)
        at org.apache.qpid.server.virtualhost.AbstractVirtualHost.getAttainedMes
sageSource(AbstractVirtualHost.java:1305)
        at org.apache.qpid.server.protocol.v0_8.AMQChannel.getQueue(AMQChannel.j
ava:3561)
        at org.apache.qpid.server.protocol.v0_8.AMQChannel.receiveExchangeBound(
AMQChannel.java:2475)
        at org.apache.qpid.server.protocol.v0_8.transport.ExchangeBoundBody.proc
ess(ExchangeBoundBody.java:130)
        at org.apache.qpid.server.protocol.v0_8.ServerDecoder.processMethod(Serv
erDecoder.java:153)
        at org.apache.qpid.server.protocol.v0_8.AMQDecoder.processFrame(AMQDecod
er.java:203)
        at org.apache.qpid.server.protocol.v0_8.BrokerDecoder.doProcessFrame(Bro
kerDecoder.java:141)
        at org.apache.qpid.server.protocol.v0_8.BrokerDecoder.processFrame(Broke
rDecoder.java:65)
        at org.apache.qpid.server.protocol.v0_8.AMQDecoder.processInput(AMQDecod
er.java:185)
        at org.apache.qpid.server.protocol.v0_8.BrokerDecoder$1.run(BrokerDecode
r.java:104)
        at org.apache.qpid.server.protocol.v0_8.BrokerDecoder$1.run(BrokerDecode
r.java:97)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at org.apache.qpid.server.protocol.v0_8.BrokerDecoder.processAMQPFrames(
BrokerDecoder.java:96)
        at org.apache.qpid.server.protocol.v0_8.AMQDecoder.decode(AMQDecoder.jav
a:118)
        at org.apache.qpid.server.protocol.v0_8.ServerDecoder.decodeBuffer(Serve
rDecoder.java:44)
        at org.apache.qpid.server.protocol.v0_8.AMQPConnection_0_8Impl.onReceive
(AMQPConnection_0_8Impl.java:250)
        at org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$receiv
ed$2(AbstractAMQPConnection.java:576)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at org.apache.qpid.server.transport.AbstractAMQPConnection.received(Abst
ractAMQPConnection.java:571)
        at org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(
MultiVersionProtocolEngine.java:135)
        at org.apache.qpid.server.transport.NonBlockingConnection.processAmqpDat
a(NonBlockingConnection.java:611)
        at org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.p
rocessData(NonBlockingConnectionPlainDelegate.java:58)
        at org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBloc
kingConnection.java:496)
        at org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBloc
kingConnection.java:270)
        at org.apache.qpid.server.transport.NetworkConnectionScheduler.processCo
nnection(NetworkConnectionScheduler.java:134)
        at org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.p
rocessConnection(SelectorThread.java:575)
        at org.apache.qpid.server.transport.SelectorThread$SelectionTask.perform
Select(SelectorThread.java:366)
        at org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(Sel
ectorThread.java:97)
        at org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.ja
va:533)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoo
lExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo
olExecutor.java:628)
        at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0
(QpidByteBufferFactory.java:464)
        at java.base/java.lang.Thread.run(Thread.java:834)
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T20:19:16Z


No joy. Haven't managed to find a workaround for the qpid exception in #87105-32.

It may or may not happen in wres during a particular evaluation, but the errors that happened in wres in #87105-22 also caused unrecoverable behavior and I have witnessed those in the much simpler test harness too.

Actual behavior is that publication stops when the broker/connection dies (expected) and then restarts when it comes back (expected). Meanwhile, the consumer stops consuming (expected) but it also throws an internal exception (unexpected; #87105-33) and, when the broker restarts and a connection is reestablished by the core client and graphics client (expected), then the evaluation is seemingly unrecoverable (unexpected).

Bottom line, the time expended debugging qpid is probably better spent on moving to a fully supported/non-legacy amqp 1.0 broker and jms 2.0 client stack. Yes, it has much bigger scope than this ticket, but this is starting to consume significant effort and is probably not worthwhile as I have observed several qpid bugs in the course of this ticket that are not likely to be fixed.

Currently, the wres essentially has +no resilience+ if a messaging client (edit: or broker) fails once the messaging is underway, it only has resilience if a format writing client or an eventsbroker fail +before+ the messaging starts (e.g., prior to negotiation of a format writer). We also have the overall resilience from the tasker that whole evaluations are retried under limited circumstances.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-01-26T20:46:04Z


Suppose we expanded our rabbitmq instance and made it into a cluster of say 3-5 nodes, would that help?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T21:48:56Z


Jesse wrote:

Suppose we expanded our rabbitmq instance and made it into a cluster of say 3-5 nodes, would that help?

It might in the sense that there's (much) less overhead in moving from qpid to rabbitmq than from qpid to an amqp 1.0 broker, but there's still overhead. Additionally, since there is no embedding of a rabbimq broker within a java process, we add a degree of separation (and more configuration to maintain) between our unit/integration and maybe system testing (embedded qpid), laptop use (embedded qpid) and cowres use (rabbitmq). Also, I think the jms client for rabbitmq has several limitations that they document themselves.

Overall, I am not inclined towards that pathway, although I appreciate the suggestion and it might get us to something better sooner (at the expense of more work overall). I want to move to an amqp 1.0 stack and, specifically, to an amqp 1.0 broker that supports both within-process and out-of-process operation using a consistent configuration (other than the embedding itself) and a jms 2.0 messaging client. I've been looking at the activemq stack for a while (specifically, artemis) and I haven't seen anything yet to dissuade me from that pathway.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-26T23:48:08Z


Further evidence that the cause is probably a race condition within the broker.

If I insert a 10ms delay on publication of all messages into the wres @MessagePublisher@, then I don't see failover errors from the broker on restarting the broker and the evaluation completes the failover successfully without any duplication or loss of messages. In other words, the strategy that works in the toy test harness also works in the wres proper.

( faod, there is obviously no way that I am adding an artificial latency into the software to circumvent this probable broker bug, but it's another data point. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-27T19:21:27Z


James wrote:

Improvements include retries on publishing and exponential back-off on both publishing and consuming, as well as a reduced list of exceptions that trigger retries on consumption (not all @RuntimeException@).

Will push this stuff shortly. Just running through a bunch of manual tests, alongside the automated ones.

One annoyance related to durable queues is that they stick around on the broker if the application doesn't get a chance to clean them up. Main example would be a sigkill on a posix platform or any shutdown on a windows platform that is not via posix emulation, like cygwin. In short, for a long-running messaging client that creates long-running subscribers with durable queues (like a long-running graphics client), these queues will stick around unless the client is started in an environment where processes receive termination signals and the process receives an orderly shutdown signal (that triggers the java shutdown hook).

I don't believe there are any aggregate bad effects, other than the annoyance and whatever small memory overhead is allocated to the queues, but it's still irritating.

Arguably, there is no point to these durable queues until we can survive a broker restart - so they could be replaced with temporary ones (a one liner) edit: that autodelete - but the goal is to survive a broker restart, so I am leaving them.

They can be removed manually via the management interface, for example, and will die when a new container is created (edit: to be clear, because they will receive an orderly shutdown signal with the linux os in the container).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-01-29T10:35:27Z


Will not do more than commit:wres|7b161efc3fff3bfe8377b2f291651c761b564d79 for now.

Has retries on publishing and consuming with exponential backoff. Number of retries is changed in configuration (the binding URL).

More work will be needed on this, but it isn't worth the effort until we have a broker that is less buggy on restarts. Surviving broker restarts should be an easily achievable goal. Surviving core client or graphics clients restarts will be harder because the client's internal state will need to be restored to the state when the failure occurred, whilst also avoiding the restoration/cycling of a guaranteed failing state.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T13:06:10Z


The error in #92536 was dropped connections leading to the temporary queues associated with graphics clients being deleted (lifetime policy of @DELETE_ON_NO_OUTBOUND_LINKS@). This was not transparent to the graphics clients (no exception on the amqp connection), which led to them continuing as though nothing had happened, but no longer accepting work, which in turn led to evaluations (that require graphics) timing out.

There's probably a few possibilities to mitigate. Connection heartbeating may help to clarify (to the graphics client) when a connection that hasn't seen an exception is no longer active.

https://qpid.apache.org/releases/qpid-jms-amqp-0-x-6.4.0/jms-amqp-0-8-book/JMS-Client-0-8-Client-Understanding-Connection.html

This is a potentially simple thing to add.

Alternatively, we make durable queues work properly (a lot more work).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T13:24:23Z


Note #92536-52. It would be nice to reproduce the scenario whereby a problem occurs on an amqp connection that does not result in an exception being thrown against it (that is transparent to the format client). For example, if an amqp connection is dropped by the broker, a connection closed error will be thrown (302, forced) and that is transparent to the format client. In the situation in #92536, no such connection error was seen by the format client, yet the broker encountered an exception.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T13:25:43Z


There must be something between "fully persisted durable queue" and "always deleted when connections fail temporary queue."

The broker has the queues. Obviously we don't want abandoned queues to live on forever, but I doubt it has to become durable and also doubt it must be deleted on connection drop.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T13:26:19Z


@tcpkill@ is a great tool to simulate these kinds of situations.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-05-28T13:31:26Z


Truth be told I am not certain what happens on the tasker/broker/worker-shim side when a connection fails from the worker-shim to the broker. Then again the worker-shim is a net publisher, it's the tasker doing consumption. Maybe because the tasker and broker always run on the same machine we get lucky. But then again, the same situation holds for the worker and eventsbroker and associated graphics instances (same machine).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T13:54:10Z


Jesse wrote:

There must be something between "fully persisted durable queue" and "always deleted when connections fail temporary queue."

The broker has the queues. Obviously we don't want abandoned queues to live on forever, but I doubt it has to become durable and also doubt it must be deleted on connection drop.

I think these things are defined at the protocol level, perhaps with bonus stuff in particular middleware, like the rabbitmq extension for publisher confirms.

If I recall correctly, amqp 0.9.1 has a concept of durable, temporary and auto-delete (edit: which applies to all node types).

In contrast, amqp 1.0 has this:

http://docs.oasis-open.org/amqp/core/v1.0/amqp-core-messaging-v1.0.html

( Scroll down to 3.5.9 Node Properties. edit: direct link: http://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-messaging-v1.0-os.html#type-node-properties)

We don't need to use durable queues for all messaging components. As long as the core client is short-lived, it doesn't make much sense for format writing queues created by the core clients to persist. The external format writing clients (currently only graphics) are long-lived and it makes perfect sense for those clients to create durable queues that survive broker/server restarts. Currently, the queue durability (on/off) is an implementation detail of the @EvaluationSubscriber@, but could be exposed (edit: I mean, to aid it being set differentially by different clients).

Another thing, bigger picture. Qpid is pretty horrible, and the qpid documentation is worse. It's a legacy stack now, not actively developed. In other words, I am not sure how much effort to invest in fixing problems with this stack versus moving across to the amqp 1.0 protocol (and associated middleware, like activemq or similar - obviously protocols and messaging middlewares are two different things, but they are connected w/r to where we should go next).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T13:59:36Z


Jesse wrote:

@tcpkill@ is a great tool to simulate these kinds of situations.

Right, but I'm not sure that is currently available in our rhel ubi, but could be added of course, whereas I see iproute, which contains ss and I think that can do pretty much the same stuff. But, yes, I agree, it needs something like this to simulate and reproduce. However, reproduce what? I'm not entirely sure what happened at the network level in #92536 or how I would find out.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T14:59:27Z


Might have a go at simulating with @ss@ or @tcpkill@ because it would be nice to reproduce, on the one hand, and confirm whether a heartbeat mitigates by ensuring that a connection exception occurs and is propagated to the graphics client, on the other hand. Adding the heartbeat is literally no more effort than adding an extra parameter to the connection url. All the other stuff is a lot more work (in varying degrees).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T15:16:16Z


I can see the tcp connections for the jms/graphics client (edit: only one client, three queues, three amqp over tcp connections) from within the eventsbroker container.

sh-4.4$ ss state connected -t | grep jms
ESTAB 0      0      172.19.254.196:jms  172.19.254.198:43318
ESTAB 0      0      172.19.254.196:jms  172.19.254.198:43322
ESTAB 0      0      172.19.254.196:jms  172.19.254.198:43326
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T15:20:37Z


sh-4.4$ ss --help | grep 'kill'
   -K, --kill          forcibly close sockets, display what was closed
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T15:32:40Z


Hmmmm, nothing.

sh-4.4$ ss -K dst 172.19.254.198 dport = 43318
Netid      State      Recv-Q      Send-Q             Local Address:Port             Peer Address:Port      Process

So I guess it cannot close this one, kernel won't allow it.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T15:45:28Z


Ah.

sh-4.4$ zcat config.gz | grep CONFIG_INET_DIAG_DESTROY
1. CONFIG_INET_DIAG_DESTROY is not set
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-05-28T16:16:33Z


Seems not easy to get @tcpkill@, which is part of the @dsniff@ package?

However, it does appear to take a different approach that would bypass the kernel (edit: I mean @CONFIG_INET_DIAG_DESTROY@), namely to sniff for tcp traffic and then formulate a correctly configured tcp reset packet. Obviously assumes that there is traffic though, so would hang indefinitely on an idle connection. I guess the tcp heartbeat would solve that too, but it would not allow us to formulate a test case with/without it.

Hmmmm, seems like a tough problem to simulate (even ignoring that we don't know exactly what happened in #92536). Open to ideas.