Deadlocks when fetching messages from queue

tadasauciunas commented 4 years ago

Once per week a cronjob creates puts approx. 2000 messages into the queue during one transaction in our system. We have 4 workers processing messages from this doctrine queue. Unfortunately, several workers usually end up in deadlocks when fetching these messags from the queue. I tried looking or solutions but did not manage to come up with anything, also did not find anything similar in the already existing issues, so I'll leave the exception information below.

Caught exception while processing queue; An exception occurred while executing 'SELECT * FROM queue_default WHERE (status = ?) AND (queue = ?) AND (scheduled <= ?) ORDER BY priority ASC, scheduled ASC LIMIT 1 FOR UPDATE' with params [1, "my_queue", "2020-07-13 04:00:25.690387"]:

SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock try restarting transaction; An exception occurred while executing 'SELECT * FROM queue_default WHERE (status = ?) AND (queue = ?) AND (scheduled <= ?) ORDER BY priority ASC, scheduled ASC LIMIT 1 FOR UPDATE' with params [1, "my_queue", "2020-07-13 04:00:25.690387"]:

roelvanduijnhoven commented 4 years ago

@tadasauciunas An deadlock exception can always occur in a relational database system I am afraid.

Having said that: we could think about solving this in the library. As this query polls for a new queue entry. And if that fails, we could simply retry.

However I should say, we never have this issue in our production stack. @tadasauciunas What indexes did you place on this table?

MatthiasKuehneEllerhold commented 4 years ago

The problem is far worse than this. We've expanded our background queue system and suddenly a lot more of these DeadlockExceptions are cropping up. Btw we're not using SlmQueue anymore but a heavily modified variant.

A worker popping a job can handle this exception fairly easily: Just reset the entity manager (-> new entity manager reference! You have to update all injected dependencies) or just restart the worker. But we've also seen the problem with pushing changes of a job back to the queue after it was done executing.

Example scenario A: Worker A with Job A is in execution (STATUS_RUNNING) and is now done. Worker B tries to get a new Job and locks the table. The worker A now wants to write back the changes of Job A (now STATUS_BURIED ?) and gets the DeadlockException. Job A is now forever in STATUS_RUNNING.

Scenario B: The web frontend wants to add a new Job, but Worker C tries to get a new job. C has the table locked and the frontend cant insert the new job! This job is now missing forever in the queue and the entity manager of the web frontend is now closed -> nothing else can be done in this request.

The root problem is that MariaDB / MySQL locks ALL rows that are scanned for a "SELECT ... FOR UPDATE" statement and not just the single row we care about. See https://stackoverflow.com/questions/6066205/when-using-mysqls-for-update-locking-what-is-exactly-locked for more info. I dont know if PostgreSQL has the same problem...

Proposed change (in pseudo code):

Instead of transactions and locking use an UPDATE query (limit 1) that sets the job into a new status (STATUS_ALLOCATING) with a unique (!) identifier.
If the row count of the UPDATE is zero -> no Jobs ready for execution -> wait X seconds.
If the row count of the UPDATE is 1 -> SELECT the job via the unique identifier that you set and UPDATE it to STATUS_RUNNING.

The critical path here is the first UPDATE. Since no locking and no transaction is used this UPDATE shouldnt block any other queries. AFAIK it is still atomic so we only get one worker for a single job even if the UPDATE-queries are fired simultaneously.

roelvanduijnhoven commented 4 years ago

@MatthiasKuehneEllerhold At first I told you whe did not suffer from this problem. But indeed the cases you now mention we do encounter. So also I gladly would like to see this fixed.

Good find. I did not know it would lock the whole table. Sounds silly, but indeed that seems to be the case from what I read.

The solution you propose sound legit, although you would think / hope there would be some other solution using SQL that would be more efficient ..

PostgreSQL: Documentation: 9.0: SELECT

If FOR UPDATE or FOR SHARE is specified, the SELECT statement locks the selected rows against concurrent updates. (See FOR UPDATE/FOR SHARE Clause below.)

I also do not use PostgreSQL, but this seems to indicate Postres does lock correctly.

But I'm in favor for a PR that simply replaces the mechanism we know have with the one you propose. Would you like to take a stab at this @MatthiasKuehneEllerhold?

The problem is far worse than this. We've expanded our background queue system and suddenly a lot more of these DeadlockExceptions are cropping up. Btw we're not using SlmQueue anymore but a heavily modified variant.

@MatthiasKuehneEllerhold Care to tell what you add to the system :partying_face:? Maybe some of that juice that flow back to the main repo? Would be willing to help in that effort!

basz commented 4 years ago

I do not fully understand the cause of this issue in-depth. As a non expert I would expect a db to handle this for me. But this is what I do know.

We are currently using Postgres with a modest of amount of runners and jobs and never have seen these exceptions. We switched from MySQL a while (2yrs) back and I cannot recall having these kind of errors back then either... maybe our load was too little...

As for a solution; what I have been reading seems to suggest these errors are to be expected (again, as a non db expert that surprises me) and the remedy would be to simply catch and retry then for a few times... @roelvanduijnhoven also has suggested something like that above.

roelvanduijnhoven commented 4 years ago

As for a solution; what I have been reading seems to suggest these errors are to be expected

@basz Yes. You can never 100% prevent deadlocks. But.. MySQL locking a whole table is kinda troublesome and makes this far more likely to occur! We have been seeying these errors using MySQL as well, more than we used to. So I guess it indeed depends on the size of the queueing system.

and the remedy would be to simply catch and retry then for a few times..

Catching and retrying works in the case where the query that atomicly fetches a new row fails. But not in the other cases that @MatthiasKuehneEllerhold described. Those would all need to be handled in user-land code, and that sucks.

The mechanism @MatthiasKuehneEllerhold describes will resolve the full table lock. But will decrease performance. The question is: do we want to keep a separate mechanism for Postgres? Maybe we should try to keep it?

MatthiasKuehneEllerhold commented 4 years ago

If Postgres handles this correctly the adapter could switch between both implementations. Use the old "FOR UPDATE" statement for postgres and use another implementation for a more generic MySQL flavor like MariaDB.

I can provide a PR if you want, but I'm unable to test it because as I said we dont use SlmQueue anymore.

Our system added a lot of features that (we thought) were missing from SlmQueue. We needed locks (e. g. 2 different jobs of different queues may not run together if they're accessing the same entity), more granular canExecute() checks and doExecute() returns, direct FKs to business-logic entities for easier access, a messaging-system to log more information per job (e. g. "did x, failed at y, restarted at 09:00, succeeded with y", per-status timeouts (e. g. reschedule a job if its in "executing" for more than 10 minutes), ... So its hard to provide upstream PRs without completely revamping everything. :)

MatthiasKuehneEllerhold commented 4 years ago

Added a WIP pull request. Still missing is a way to migrate the table because we need a new field for the workerId...

Webador / SlmQueueDoctrine

Deadlocks when fetching messages from queue #160