Open stliu opened 1 year ago
Accoring to this doc https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-amqp-troubleshoot#connection-is-closed
an AMQP Link will be closed after 10 minutes idle
You see the following error when the AMQP connection and link are active but no calls (for example, send or receive) are made using the link for 10 minutes. So, the link is closed. The connection is still open.
amqp:link:detach-forced:The link 'G2:7223832:user.tenant0.cud_00000000000-0000-0000-0000-00000000000000' is force detached by the broker due to errors occurred in publisher(link164614). Detach origin: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00. TrackingId:00000000000000000000000000000000000000_G2_B3, SystemTracker:mynamespace:Topic:MyTopic, Timestamp:2/16/2018 11:10:40 PM
an AMQP Connection will be closed after all links are closed
You see the following error on the AMQP connection when all links in the connection have been closed because there was no activity (idle) and a new link has not been created in 5 minutes.
Error{condition=amqp:connection:forced, description='The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:00000000000000000000000000000000000_G21, SystemTracker:gateway5, Timestamp:2019-03-06T17:32:00', info=null}
And the last one
You see this error when a new AMQP connection is created but a link is not created within 1 minute of the creation of the AMQP Connection.
Error{condition=amqp:connection:forced, description='The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0000000000000000000000000000000000000_G21, SystemTracker:gateway5, Timestamp:2019-03-06T18:41:51', info=null}
So, for a long-running application which is very busy at daytime ( 1000 message per second ) and very low traffic at night ( 1 message per hour ), what is the strategy here?
first and foremost, whenever that message comes, template.send(message)
must success and the message is delivered to the Azure Service Bus.
so, it leaves us with two options:
2022-11-06 10:57:03.609 SENT: SASL
2022-11-06 10:57:03.875 RECV: SASL
2022-11-06 10:57:03.877 RECV: SaslMechanisms{saslServerMechanisms=[MSSBCBS, PLAIN, ANONYMOUS, EXTERNAL]}
2022-11-06 10:57:03.897 SENT: SaslInit{mechanism=PLAIN, initialResponse=\x00RootManageSharedAccessKey\x00+GtCmBe/x4v6I5/mxEHKv6JxVrRu0PrKo3UjN1csjYg=, hostname='shaozliu.servicebus.windows.net'}
2022-11-06 10:57:04.165 RECV: SaslOutcome{_code=OK, _additionalData=Welcome!}
2022-11-06 10:57:04.167 SENT: AMQP
2022-11-06 10:57:04.190 SENT: Open{ containerId='azure:2657a071-8d5d-427a-86f9-d6e76ffa4a5e:1', hostname='shaozliu.servicebus.windows.net', maxFrameSize=1048576, channelMax=32767, idleTimeOut=50000, outgoingLocales=null, incomingLocales=null, offeredCapabilities=null, desiredCapabilities=[sole-connection-for-container, DELAYED_DELIVERY, ANONYMOUS-RELAY, SHARED-SUBS], properties={com.microsoft:is-client-provider=true, product=QpidJMS, version=0.53.0, platform=JVM: 17.0.4.1, 17.0.4.1+1-LTS, Microsoft, OS: Mac OS X, 12.6.1, x86_64}}
2022-11-06 10:57:04.434 RECV: AMQP
2022-11-06 10:57:04.459 RECV: Open{ containerId='07bc72c10fde44afb67c5f833eeb5e41_G0', hostname='null', maxFrameSize=65536, channelMax=4999, idleTimeOut=120000, outgoingLocales=null, incomingLocales=null, offeredCapabilities=null, desiredCapabilities=null, properties=null}
2022-11-06 10:57:04.468 SENT: Begin{remoteChannel=null, nextOutgoingId=1, incomingWindow=2047, outgoingWindow=2147483647, handleMax=65535, offeredCapabilities=null, desiredCapabilities=null, properties=null}
2022-11-06 10:57:04.735 RECV: Begin{remoteChannel=0, nextOutgoingId=1, incomingWindow=5000, outgoingWindow=2047, handleMax=255, offeredCapabilities=null, desiredCapabilities=null, properties=null}
2022-11-06 10:57:04.751 [492364852:1] SENT: Begin{remoteChannel=null, nextOutgoingId=1, incomingWindow=2047, outgoingWindow=2147483647, handleMax=65535, offeredCapabilities=null, desiredCapabilities=null, properties=null}
2022-11-06 10:57:05.064 [492364852:1] RECV: Begin{remoteChannel=1, nextOutgoingId=1, incomingWindow=5000, outgoingWindow=2047, handleMax=255, offeredCapabilities=null, desiredCapabilities=null, properties=null}
2022-11-06 10:57:05.085 [492364852:1] SENT: Attach{name='qpid-jms:sender:azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1:t4', handle=0, role=SENDER, sndSettleMode=UNSETTLED, rcvSettleMode=FIRST, source=Source{address='azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1', durable=NONE, expiryPolicy=SESSION_END, timeout=0, dynamic=false, dynamicNodeProperties=null, distributionMode=null, filter=null, defaultOutcome=null, outcomes=[amqp:accepted:list, amqp:rejected:list, amqp:released:list, amqp:modified:list], capabilities=null}, target=Target{address='t4', durable=NONE, expiryPolicy=SESSION_END, timeout=0, dynamic=false, dynamicNodeProperties=null, capabilities=[queue]}, unsettled=null, incompleteUnsettled=false, initialDeliveryCount=0, maxMessageSize=null, offeredCapabilities=null, desiredCapabilities=[DELAYED_DELIVERY], properties=null}
2022-11-06 10:57:05.377 [492364852:1] RECV: Attach{name='qpid-jms:sender:azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1:t4', handle=0, role=RECEIVER, sndSettleMode=UNSETTLED, rcvSettleMode=FIRST, source=Source{address='azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1', durable=NONE, expiryPolicy=SESSION_END, timeout=0, dynamic=false, dynamicNodeProperties=null, distributionMode=null, filter=null, defaultOutcome=null, outcomes=[amqp:accepted:list, amqp:rejected:list, amqp:released:list, amqp:modified:list], capabilities=null}, target=Target{address='t4', durable=NONE, expiryPolicy=SESSION_END, timeout=0, dynamic=false, dynamicNodeProperties=null, capabilities=[queue]}, unsettled=null, incompleteUnsettled=false, initialDeliveryCount=null, maxMessageSize=1048576, offeredCapabilities=[DELAYED_DELIVERY], desiredCapabilities=null, properties=null}
2022-11-06 10:57:05.456 [492364852:1] SENT: Transfer{handle=0, deliveryId=0, deliveryTag=\x00, messageFormat=0, settled=false, more=false, rcvSettleMode=null, state=null, resume=false, aborted=false, batchable=false} (228) "\x00Sp\xc0\x02\x01A\x00Sr\xc1)\x04\xa3\x0ex-opt-jms-destQ\x00\xa3\x12x-opt-jms-msg-typeQ\x05\x00Ss\xd0\x00\x00\x00O\x00\x00\x00\x0a\xa15ID:azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1-1@\xa1\x02t4@@@@@@\x83\x00\x00\x01\x84J\xde\xc9\xa2\x00St\xc1\x1f\x02\xa1\x05_type\xa1\x15com.example.demo.User\x00Sw\xa1/{"name":"2022-11-06T10:57:01.630090 message 0"}"
2022-11-06 10:57:05.788 [492364852:1] RECV: Disposition{role=RECEIVER, first=0, last=null, settled=true, state=Accepted{}, batchable=false}
about 10 minutes idle since the first message sending, now we get the detach command from server to close the link, according to this https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-amqp-troubleshoot#link-is-closed
2022-11-06 11:07:05.760 [492364852:1] RECV: Detach{handle=0, closed=true, error=Error{condition=amqp:link:detach-forced, description='The link 'G0:36906660:qpid-jms:sender:azure:5caf3ef4-9602-413c-964d-cf1292d6e1f5:1:1:1:t4' is force detached. Code: publisher(link376). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00.', info=null}}
2022-11-06 11:07:05.769 [492364852:1] SENT: Detach{handle=0, closed=true, error=null}
since we only had 1 link and it got closed 5 minutes ago, so we get another command to close the connection according to this https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-amqp-troubleshoot#connection-is-closed
2022-11-06 11:12:06.077 [492364852:1] RECV: End{error=null}
2022-11-06 11:12:06.077 RECV: Close{error=Error{condition=amqp:connection:forced, description='The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:07bc72c10fde44afb67c5f833eeb5e41_G0, SystemTracker:gateway5, Timestamp:2022-11-06T03:12:06', info=null}}
2022-11-06 11:12:06.078 [492364852:1] SENT: End{error=null}
2022-11-06 11:12:06.081 SENT: Close{error=null}
Full AMQP tracing can be found here 8e31f0b3a82e20f391312a622083d9d5
After some internal discussion, @shankarsama would consider to adding an AMQP property for making the link timeout configurable, which we can leverage to fix this issue. And @shankarsama , would you help to update here when that configuration is released in the service side?
But @yiliuTo, your proposal will NOT FIX the issue.
In https://github.com/microsoft/azure-spring-boot/issues/817#issuecomment-1306509890 you've clearly stated that here you're gonna provide a fix, not a workaround.
While the option to configure EFFECTIVE timeout (which ServiceBus will respect for a change), is a nice feature, prolonging or disabling timeout WILL NOT FIX the underlying issue that JmsTemplate provided by your starter cannot recover from obviously recoverable exceptions.
Suggestion: leave the timeouts be, just make the JmsTemplate able to recover internally when a known situation like this happens. I will repeat myself, but your other library azure-servicebus
has no problem recovering from timeouts - client instance created once, just works when needed. This is really all that everyone expects and wants... OK, maybe 5s on initial send is also a bit too much
Suggestion: leave the timeouts be, just make the JmsTemplate able to recover internally when a known situation like this happens.
@zoladkow , thanks for your suggestion, we will consider about it.
any update on this issue?
any update on this issue?
No updates at the moment. Rest assured, we are actively try to resolve it and will keep you informed if there are any developments.
@Netyyyy @saragluna can you update this issue? How is it going? The only way to get around it at the moment is the "connection pool" workaround, right? Why is this such a big issue?
@Netyyyy @saragluna can you update this issue? How is it going? The only way to get around it at the moment is the "connection pool" workaround, right? Why is this such a big issue?
Hi @neffsvg , we are still working on this issue. @vinaysurya please help take a look.
@Netyyyy @vinaysurya we are having a go live deadline in a view weeks. And QA needs to check everything first. So high pressure on us.
Is there a pr or anything we can track the progress? Given that the last updates were May 26 and 2 days ago. Or is there something we can potentially help with? I still do not understand why this would take so long, maybe I am missing something.
@zoladkow are you still using this? Maybe we can join forces with them to solve this.
@zoladkow are you still using this? Maybe we can join forces with them to solve this.
@neffsvg Not anymore, we spotted the issue very early (and long ago) and decided instead go with plain azure-servicebus library. Granted, it did not provide all the @JmsListener or JmsTemplate Spring stuff, but that's absolutely not a problem. The most important thing was, that Queue/TopicClient would not go into invalid state after idle disconnect, just handle this internally like common sense would dictate.
Oh, and this issue is by far not new. You can check the previous one here: https://github.com/microsoft/azure-spring-boot/issues/817
Given that only real solution would require to redo this starter on top of azure-servicebus (or whatever they changed the name too - yeah, naming convention changes, top priority, eh...), which would require to actually provide adapters to work with Spring JMS, i can see why they are so reluctant - the gain seems awfully too small to justify such an effort. Volunteer effort especially. Also, most likely for serverless scenarios this is never an issue, since the message is sent and function ends (unless they would "optimize" by keeping the function instance long enough for the timeout to become a problem there too...)
@neffsvg I would recommend you hands off from Azure Service Bus and mainly with the JMS combination. It is really nightmare.
We have other issues with the azure-servicebus library which are not resolved for long time. For example #33688
This issue should be addressed for now. We have made a change on ServiceBus service side to not enforce the expiring of idle links aggressively for JMS customers that come through azure-servicebus-jms libraries (which is the dependent library used by spring-cloud-azure-starter-servicebus-jms, version 4.x ). This is only available for premium messaging namespaces. There are still active quota enforcements on the number of active links that can be had for a namespace at any given point in time.
The real long term fix will likely need a fix involving Qpid JMS library, where a producer object on client is not immediately considered close on receiving a link close. Instead when using the jmsproducer object, the qpid jms client library has to check if the underlying link is in closed state and if so re-create the underneath amqp link again.
Please note that the fix on the ServiceBus service side is in the process of deployment. We expect the fix to be deployed in about 3 weeks from today across all ServiceBus clusters.
I had the same issue and I fixed it by enabling pool in spring boot
spring:
jms:
servicebus:
connection-string: Endpoint=sb://*******w=
pricing-tier: standard
pool:
enabled: true
More information
Describe the bug When using
JMSTemplate
to send message to Service Bus, it throws exception if the connection is idle more than 10 minutesanother issue. is the performance, it takes 5 seconds to send the first message, see the log below for more detials
Exception or Stack Trace
To Reproduce
Add the code snippet that causes the issue.
it can be reproducecd by this code
I'm using
Expected behavior
There are two problems with this issue:
the underlying connection is not kept alive
with tracing log enabled, we can see, there is a
IdleTimeoutCheck
is running every 60 seconds to try to keep the connection alive by sending an Empty Frame, but seems this doesn't keep the connection alive, so how to keep the connection alive? is it related to https://github.com/Azure/azure-sdk-for-python/pull/10209?org.springframework.jms.connection.SingleConnectionFactory#reconnectOnException
is not honoured.Spring Cloud Azure uses
CachingConnectionFactory
by default, according to its javadocFrom the current behavior of this exception, it doesn't auto recovery from the underlying connection.
I have the full log attached below
first message delivered! why it take 4 seconds to send a single message??? can we do something to speed up this? warmup during bootstrap process?
--- seems Server send this message every 3 minutes
---> looks like this IdleTimeoutCheck doesn't work, maybe Azuer Service bus doesn't expect an empty frame?
--> Now in LINK_FINAL state, takes 10 minutes, is it a hard limit from Service Bus? is there a doc saying that? https://github.com/Azure/azure-sdk-for-python/issues/10127 https://github.com/Azure/azure-sdk-for-python/pull/10209
--> About to send a new message, the first one was sent in 2022-11-06 10:57:01.630, and we have a 15 minutes scheduler