NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

rabbitmq sometimes cannot recover the connection #394

Closed jeanetteclark closed 5 months ago

jeanetteclark commented 7 months ago

RabbitMQ is set up to auto-recover connections that fail for network reasons, but sometimes this auto-recovery seems to fail.

Here are the logs showing the connection reset by peer error, an apparent channel recovery, then the exception

20240122-22:00:12: [WARN]: An unexpected connection driver error occurred (Exception message: Connection reset by peer) [com.rabbitmq.client.impl.ForgivingExceptionHandler:115]
20240122-22:00:17: [DEBUG]: Creating executor service with 256 thread(s) for consumer work service [com.rabbitmq.client.impl.ConsumerWorkService:40]
20240122-22:00:17: [DEBUG]: Connection amqp://metadig@10.106.39.185:5672/ has recovered [com.rabbitmq.client.impl.recovery.AutorecoveringConnection:588]
20240122-22:00:17: [DEBUG]: Channel AMQChannel(amqp://metadig@10.106.39.185:5672/,1) has recovered [com.rabbitmq.client.impl.recovery.AutorecoveringConnection:648]
20240122-22:00:17: [DEBUG]: RecordedExchange[name=metadig, type=direct, durable=true, autoDelete=false, arguments=null, channel=AMQChannel(amqp://metadig@10.106.39.185:5672/,1)] has recovered [com.rabbitmq.client.impl.recovery.AutorecoveringConnection:760]
20240122-22:00:17: [DEBUG]: Recovering RecordedQueue[name=quality, durable=true, autoDelete=false, exclusive=false, arguments=nullserverNamed=false, channel=AMQChannel(amqp://metadig@10.106.39.185:5672/,1)] [com.rabbitmq.client.impl.recovery.AutorecoveringConnection:799]
20240122-22:00:17: [ERROR]: Caught an exception when recovering topology Caught an exception while recovering queue quality: null [com.rabbitmq.client.impl.ForgivingExceptionHandler:119]
com.rabbitmq.client.TopologyRecoveryException: Caught an exception while recovering queue quality: null
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverQueue(AutorecoveringConnection.java:782)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverTopology(AutorecoveringConnection.java:721)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:597)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:519)
    at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:824)
    at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:801)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:685)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException
    at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:132)
    at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:128)
    at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:150)
    at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:971)
    at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:46)
    at com.rabbitmq.client.impl.recovery.RecordedQueue.recover(RecordedQueue.java:60)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$internalRecoverQueue$13(AutorecoveringConnection.java:803)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.wrapRetryIfNecessary(AutorecoveringConnection.java:909)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.internalRecoverQueue(AutorecoveringConnection.java:802)
    at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverQueue(AutorecoveringConnection.java:778)
    ... 7 more
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - home node 'rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local' of durable queue 'quality' in vhost '/' is down or inaccessible, class-id=50, method-id=10)
    at com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:66)
    at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
    at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:505)
    at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:296)
    at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:144)
    ... 14 more

NB: The application is currently running the bitnami chart 10.3.9 for rabbitmq which has the 3.10.8 release of rabbitmq. this release reached EOL on Dec 31 so we should probably upgrade rabbitmq to the latest version which is 12.8.0

Edit: more log messages from RMQ that might be important

2024-01-22 22:00:12.356407+00:00 [error] <0.1248.0> Failed to fetch a list of nodes from Kubernetes API: timeout
2024-01-22 22:00:12.358142+00:00 [warning] <0.2718.43> Description: "Authenticity is not established by certificate path validation"
2024-01-22 22:00:12.358142+00:00 [warning] <0.2718.43>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-22 22:00:12.358142+00:00 [warning] <0.2718.43> 
2024-01-22 22:00:14.429285+00:00 [warning] <0.2724.43> Description: "Authenticity is not established by certificate path validation"
2024-01-22 22:00:14.429285+00:00 [warning] <0.2724.43>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
jeanetteclark commented 7 months ago

from metadig-rabbitmq-2 on metadig-prod

2024-01-27 01:17:15.503584+00:00 [info] <0.29211.57> Node rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local is down, deleting its listeners
2024-01-27 01:17:15.506133+00:00 [info] <0.29211.57> 1 transient queues from an old incarnation of node 'rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local' deleted in 0.002372s
2024-01-27 01:17:15.506828+00:00 [info] <0.29211.57> node 'rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local' down: net_tick_timeout
2024-01-27 01:17:15.508839+00:00 [error] <0.29211.57> Partial partition detected:
2024-01-27 01:17:15.508839+00:00 [error] <0.29211.57>  * We saw DOWN from rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local
2024-01-27 01:17:15.508839+00:00 [error] <0.29211.57>  * We can still see rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local which can see rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local
2024-01-27 01:17:15.508839+00:00 [error] <0.29211.57> We will therefore intentionally disconnect from rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local
2024-01-27 01:17:16.509849+00:00 [info] <0.29211.57> rabbit on node 'rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local' down
2024-01-27 01:17:16.689184+00:00 [info] <0.29211.57> Node rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local is down, deleting its listeners
2024-01-27 01:17:16.690188+00:00 [info] <0.29211.57> node 'rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local' down: disconnect
2024-01-27 01:17:17.564922+00:00 [warning] <0.5490.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:17.564922+00:00 [warning] <0.5490.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:17.564922+00:00 [warning] <0.5490.71> 
2024-01-27 01:17:23.707971+00:00 [info] <0.29211.57> node 'rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local' up
2024-01-27 01:17:23.709438+00:00 [error] <0.28697.57> Mnesia('rabbit@metadig-rabbitmq-2.metadig-rabbitmq-headless.metadig.svc.cluster.local'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local'}
2024-01-27 01:17:23.709438+00:00 [error] <0.28697.57> 
2024-01-27 01:17:23.711015+00:00 [warning] <0.5504.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:23.711015+00:00 [warning] <0.5504.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:23.711015+00:00 [warning] <0.5504.71> 
2024-01-27 01:17:23.900701+00:00 [warning] <0.5512.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:23.900701+00:00 [warning] <0.5512.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:23.900701+00:00 [warning] <0.5512.71> 
2024-01-27 01:17:26.163949+00:00 [error] <0.29652.57> Failed to fetch a list of nodes from Kubernetes API: timeout
2024-01-27 01:17:26.165908+00:00 [warning] <0.5520.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:26.165908+00:00 [warning] <0.5520.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:26.165908+00:00 [warning] <0.5520.71> 
2024-01-27 01:17:28.427964+00:00 [warning] <0.29652.57> Peer discovery: node rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local is unreachable
2024-01-27 01:17:28.428091+00:00 [warning] <0.29652.57> Peer discovery: node rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local is unreachable
2024-01-27 01:17:30.913963+00:00 [warning] <0.5515.71> global: 'rabbit@metadig-rabbitmq-2.metadig-rabbitmq-headless.metadig.svc.cluster.local' failed to connect to 'rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local'
2024-01-27 01:17:30.913963+00:00 [warning] <0.5515.71> 
2024-01-27 01:17:33.901482+00:00 [warning] <0.5533.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:33.901482+00:00 [warning] <0.5533.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:33.901482+00:00 [warning] <0.5533.71> 
2024-01-27 01:17:35.984562+00:00 [info] <0.29211.57> node 'rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local' up
2024-01-27 01:17:35.988391+00:00 [warning] <0.5537.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:35.988391+00:00 [warning] <0.5537.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:35.988391+00:00 [warning] <0.5537.71> 
2024-01-27 01:17:36.165953+00:00 [error] <0.29652.57> Failed to fetch a list of nodes from Kubernetes API: timeout
2024-01-27 01:17:36.167540+00:00 [warning] <0.5543.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:36.167540+00:00 [warning] <0.5543.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:36.167540+00:00 [warning] <0.5543.71> 
2024-01-27 01:17:38.430971+00:00 [warning] <0.29652.57> Peer discovery: node rabbit@metadig-rabbitmq-0.metadig-rabbitmq-headless.metadig.svc.cluster.local is unreachable
2024-01-27 01:17:38.431106+00:00 [warning] <0.29652.57> Peer discovery: node rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local is unreachable
2024-01-27 01:17:43.901316+00:00 [warning] <0.5568.71> Description: "Authenticity is not established by certificate path validation"
2024-01-27 01:17:43.901316+00:00 [warning] <0.5568.71>      Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
2024-01-27 01:17:43.901316+00:00 [warning] <0.5568.71> 
2024-01-27 01:17:44.324627+00:00 [warning] <0.29211.57> Autoheal: we were selected to restart; winner is 'rabbit@metadig-rabbitmq-1.metadig-rabbitmq-headless.metadig.svc.cluster.local'
2024-01-27 01:17:44.324726+00:00 [info] <0.5576.71> RabbitMQ is asked to stop...
2024-01-27 01:17:44.364111+00:00 [info] <0.5576.71> Stopping RabbitMQ applications and their dependencies in the following order:
jeanetteclark commented 7 months ago

in an effort to fix the above, I upgraded rabbitmq on dev-metadig and have been putting it through its paces by sending the test.arctic node a whole slew of documents (around 400 in total). I have seen zero connection errors so far. the process for upgrading rabbitmq was:

  1. helm upgrade to chart version 11.10.1
  2. check in RMQ dashboard that all feature flags are enabled
  3. delete the two pods that didn't get a rolling upgrade (probably 0 and 1) so they can roll forward
  4. helm upgrade to chart version 12.8.2
  5. check feature flags again
  6. delete two pods to roll them forward

edit: nevermind see a few dropped connections now

jeanetteclark commented 5 months ago

updated rabbitmq with deployment of 3.0.0 and so far so good...

mbjones commented 5 months ago

Awesome.