Consumer crashes with exit code 0, when RabbitMQ goes down

salvatorecordiano commented 5 years ago

I'm using your consumer in production environment with success.

Yesterday night our RabbitMQ cluster went down, so most of our consumers died. Every consumer was automatically terminated with unexpected exit code 0, but this exit code is wrong, because the process was terminated without success.

We assume the right behaviour of the consumer process and we wrote the following chunk in a bash script:

until /var/www/project/rabbitmq-consumer/rabbitmq-cli-consumer \
-c "/var/www/project/app/config/rabbitmq/consumer-prod.conf" \
-q "q-jobs-prod" \
-c "/var/www/project/app/console --env=dev -vvv rabbitmq:endpoint" -i -o; do
    STATUS=$?
    echo "[$(date)] --> Consumer crashed with exit code $STATUS. Respawning..."
    sleep 1
done

In that way, we want to be sure that consumers are always up and running.

The previous trick was working with ricbra/rabbitmq-cli-consumer.

To reproduce this bug you can follow this procedure:

# first of all, run the consumer
/var/www/project/bin/rabbitmq-cli-consumer -c /var/www/project/app/config/rabbitmq/consumer-dev.conf -q q-jobs-dev -e "/var/www/project/app/console --env=dev -vvv rabbitmq:endpoint"

In a new terminal window:

# turn off RabbitMQ
docker stop rabbit-mq

On the first terminal you will see the consumer stopped and you can check the exit code (echo $?).

Can you help me? Thanks

corvus-ch commented 5 years ago

This is the result of the changes made in #19. Previously to that, the consumer did not handle a connection shutdown initiated by the server but now it does.

When a RabbitMQ server dies, there are two cases I am aware of.

The server crashes unexpectedly. If this happens, all connections are dropped and the consumer stops with exit code 1.
The server initiates a shutdown which includes to send notifications to all consumers. If consumer receives such a notification, it will finish processing the current job (if any) and then shuts down with exit code 0. The reasoning for this is, that a shutdown signal from the server can be viewed equally to a TERM signal from the shell.

From what I can tell from your (@salvatorecordiano) description, the second thing happened.

Your code worked so far, because the the rabbitmq-cli-consumer binary did not have any execution path which resulted in an exit with code 0.

It did not occur to me, that somebody is using the exit code to determine if a consumer needs to be restarted or not. In my model of operating the rabbitmq-cli-consumer, a supervision process like systemd or Supervisor is used.

I am inclined to consider this as works as designed. If anybody has good arguments why this behaviour should be considered wrong, I will be open for a discussion.

corvus-ch commented 5 years ago

As discover due to #42, there is currently a race condition in the way how asynchronous events get passed along. This resulted in the consumer exiting before event handlers had the change to do their business. As a matter of fact, the implementation intended to exit with code 10 when the server closes the connection.

I really feel sorry for having this messed up in my previous comment.

I will provide a fix.

corvus-ch commented 5 years ago

@salvatorecordiano Do you mind to try out #47. Does this fix your issue?

salvatorecordiano commented 5 years ago

Hi @corvus-ch, thanks. At the moment I'm not able to build the binary for Linux. Can you publish it, please? I will test your consumer immediately

corvus-ch commented 5 years ago

@salvatorecordiano I have build and published a pre release version. See https://github.com/corvus-ch/rabbitmq-cli-consumer/releases/tag/2.3.1-alpha1.

salvatorecordiano commented 5 years ago

@corvus-ch now it works properly. I wait your new release to deploy it. thank you

corvus-ch commented 5 years ago

@salvatorecordiano I just released a new version: https://github.com/corvus-ch/rabbitmq-cli-consumer/releases/tag/2.3.1.

salvatorecordiano commented 5 years ago

After your last release, I encountered a new issue. When I send to the consumer SIGTERM the output is:

2018/11/07 09:31:51 Cancel consumption of messages.
2018/11/07 09:32:58 Processed!

It receives my signal but It doesn't take care of it, so the process keeps running forever.

In the previous release when we send to the consumer SIGTERM, the process exit code is 0. When we send SIGKILL, the exit code is 137.

corvus-ch commented 5 years ago

Hi @salvatorecordiano,

I tried to reproduce your issue and was not able to do so. Can you please open a new issue so we can investigate and track this new topic?

salvatorecordiano commented 5 years ago

Hi @corvus-ch, I opened #51. In the issue description, I'm able to prove that your last release introduces this bug.

corvus-ch / rabbitmq-cli-consumer

Consumer crashes with exit code 0, when RabbitMQ goes down #40