EnMasseProject / enmasse

EnMasse - Self-service messaging on Kubernetes and OpenShift
https://enmasseproject.github.io
Apache License 2.0
190 stars 87 forks source link

Messages lost from queue when consuming application using prefetch > 10 repeatedly disconnects abruptly #2925

Open k-wall opened 5 years ago

k-wall commented 5 years ago

Describe the bug

A consuming application takes 1 message from a queue address in the standard address space and then disconnects abruptly. If the application uses a prefetch value > 10, message lost may occur after the tenth invocation of the program.

I demonstrate the issue using the tools used by the system test suite, which default to link credit 1000.

Queue addresses in the brokered address do not suffer this issue.

To Reproduce

Steps to reproduce the behavior:

  1. Create a standard address space
  2. Create a queue address of type standard-small-queue.
  3. Run the following command to put a queue depth on the queue:

cli-rhea-sender --count 250 --msg-durable true --msg-content pepa --log-lib TRANSPORT_FRM --conn-ssl true --log-msgs json --broker guest:guest@$(oc get addressspace standard -o 'jsonpath={.status.endpointStatuses[?(@.name=="messaging")].externalHost}'):443 --address myqueue1

  1. Run the following command to take a single message from the queue. (Owing to the link credit, the client may receive more messages internally, but these are not handled with a disposition. You'll see the additional transfer's in the client log written to stdout).

cli-rhea-receiver --count 1 --log-lib TRANSPORT_FRM --conn-ssl true --log-msgs json --broker guest:guest@$(oc get addressspace standard -o 'jsonpath={.status.endpointStatuses[?(@.name=="messaging")].externalHost}'):443 --address myqueue1

  1. Repeat step 4 at least 10 times.

  2. Restart the broker pod (to work around ARTEMIS-1982) and check the queue depth using the console. Queue depth likely to be much less than 240 which testifies to the defect.

  3. Check the previous broker log, messages such as these will be seen.

2019-06-20T02:23:58.693Z WARN [server] AMQ222148: Message Reference[28586]:NON-RELIABLE:AMQPMessage [durable=false, messageID=28586, address=myqueue1, size=28, applicationProperties=null, properties=null, extraProperties = TypedProperties[_AMQ_AD=myqueue1]] has exceeded max delivery attempts. No bindings for Dead Letter Address DLQ so dropping it

Expected behavior

Queue depth 240

k-wall commented 5 years ago

EnMasses does not set Artemis's max-delivery-attempts in its configured address settings, so the default of 10 will apply (org.apache.activemq.artemis.core.settings.impl.AddressSettings#DEFAULT_MAX_DELIVERY_ATTEMPTS).

From looking at the protocol trace between broker and router, I see

  1. 1 Accepted disposition the consumed message,
  2. n * Modified deliveryFailed=true, n tallies with the number additional message transfers that were received by the client but not processed before it disconnected.
  3. Sometimes a Released disposition expressed using a delivery range. I think this corresponds to message that made it as far as the router but weren't forward to the client.

The Broker's handling of the delivery-counts for Modified deliveryFailed=true and the Released dispositions is consistent with the AMQP specification. For the messages that are deliveryFailed=true, this causes their delivery count to be increment, so after the tenth attempt, the message is DLQ/dropped.

I think the Router is conservatively sending Modified deliveryFailed=true for the messages is knows it sent to the client but not received explicit dispositions. @grs is my understanding correct? If so, this behaviour does not seem ideal in a world where client applications do fail, and often fail repeatedly. Is it possible to configure the router to simply Release the message instead?

grs commented 5 years ago

@k-wall your understanding is indeed correct. It is not currently possible to configure alternative behaviour. I would argue that it is correct behaviour. The spec says 'the released outcome is used to indicate that a given transfer was not and will not be acted upon', when the client connection has failed without indicating a status on the delivery, the router cannot make that guarantee. The delivery-count is there to allow a receiver to decide whether it is possibly a redelivery.

I suspect that the broker would also mark such messages as redelivered if the router was not between the client and the broker (I certainly think it should).

My view is that in the case where a client with a prefetch is being repeatedly disconnected, high delivery counts are to be expected.

k-wall commented 5 years ago

@grs I don't disagree with your interpretation of the AMQP specification and agree that Qpid Dispatch Router is faithful to it, however, when I look at this from an end user perspective I find this troubling. I think by most messaging users expect reliable messaging so out of the box defaults should cater for that.

If I were to build an application with Apache Qpid JMS client (which prefetches by default) and use it with EnMasse, under a failure scenario were an application fails/restarts after establishing its consumer, this default behaviour would proceed to eat its way through the queue's backlog. I think expecting the application developer to know that they need to tune a parameter to avoid this behaviour is expecting too much.

I do think that Router should make this behaviour tuneable.

I tested the same scenario without the router (i..e brokered). The delivery count is not incremented in this case.

grs commented 5 years ago

@k-wall I don't see how the current behaviour of the router in any way jeopardises reliable messaging. If anything I think it supports it. What it seems to me to boil down to is whether the delivery count should be incremented on deliveries that are sent to a client (i.e. delivered) if that client fails before acknowledging them. To me it seems obvious that it should.

I also did an experiment with the broker and came to a different conclusion from you. It seems to me that the broker does increment the delivery count for messages sent out but not acknowledged (which I believe is correct behaviour). Perhaps we are testing different things?

I agree btw that the issues as reported against enmasse is far from ideal. I just disagree that the source of the issue is the incrementing of the delivery count for messages delivered to clients but unacknoweldged at the point the client is disconnected.