NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

Recover from RabbitMQ lost connection #291

Closed gothub closed 2 years ago

gothub commented 2 years ago

The k8s restarts mentioned in this k8s issue appear to be affecting the communication between RabbitMQ and metadig-worker. I'm not sure if this is the direct cause, but in any case, metadig-worker needs to be able to re-establish connection to RabbitMQ when it is lost. Metadig-engine prints out the following message to it's log, then it becomes unresponsive and will not handle new requests, as it appears RabbitMQ isn't talking to it any longer:

20210920-22:49:18: [INFO]: Total elapsed time (seconds): 1876 for metadataPid: urn:uuid:566a5ac2-4a01-4e6a-866e-986e13dc633b, suiteId: FAIR-suite-0.3.1
 [edu.ucsb.nceas.mdqengine.Worker:346]
20210920-22:49:18: [INFO]:  [x] Done [edu.ucsb.nceas.mdqengine.Worker:356]
20210920-22:49:18: [INFO]:  [x] Sent completed report for pid: 'urn:uuid:566a5ac2-4a01-4e6a-866e-986e13dc633b' [edu.ucsb.nceas.mdqengine.Worker:358]
ERROR ForgivingExceptionHandler Consumer edu.ucsb.nceas.mdqengine.Worker$1@53487557 (amq.ctag-R4BxQVhnclRPEOOGHtvDJw) method handleDelivery for channel AMQChannel(amqp://guest@10.97.176.139:5672/,1) threw an exception for channel AMQChannel(amqp://guest@10.97.176.139:5672/,1)
 com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - consumer ack timed out on channel 1, class-id=0, method-id=0)
        at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:258)
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:427)
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:421)
        at com.rabbitmq.client.impl.recovery.RecoveryAwareChannelN.basicAck(RecoveryAwareChannelN.java:93)
        at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicAck(AutorecoveringChannel.java:428)
        at edu.ucsb.nceas.mdqengine.Worker$1.handleDelivery(Worker.java:315)
        at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
        at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

TODO: look into how to re-establish connection with RabbitMQ from a client and why the 'ack' is timing out.

gothub commented 2 years ago

The heartbeat capability of RabbitMQ can be used by the client (metadig-worker) to determine when the connection to RabbitMQ has been lost, so that it can be reestablished. Note that any operation from the client to the server (RabbitMQ) can be used for this purpose, so errors should be caught when sending the 'ack' from metadig-worker to RabbitMQ and the connection re-established if needed.

RabbitMQ heartbeats are described here.

gothub commented 2 years ago

After further investigation, it's clear that the lost connection is due to a worker running for too long and exceeding the RabbitMQ heartbeat timeout (see updated worker log above).

The long running workers appear to happen when many workers are running and all workers are busy. After viewing several worker logs, in all cases the worker completes creating the assessment and indexing it, but then an error is generated when the worker sends the 'ack' to tell RabbitMQ it's ready to receive another queue entry.

gothub commented 2 years ago

metadig-worker has been updated to reestablish the connection to RabbitMQ if it has been lost, and resend the 'completed' message back to the controller if necessary, in commit 8092612dc9490d1f1e9be32d7c053353dfc59da0