At-most-once jobs are lost in cluster exchange

gfloyd commented 9 years ago

Here is a scenario:

I have a two node cluster, and I add one job to each node with RETRY 0. If I issue GETJOB on node1 twice, the second time blocks since that node's queue is empty. I can see node2 receive NEEDJOBS and send YOURJOBS to node1, and I can see that node1 receives the YOURJOBS message, but it never queues the job because of (from what I can tell) this line in queue.c:

    if (job->state == JOB_STATE_QUEUED || job->qtime == 0)
        return DISQUE_ERR;

It seems like this might have something to do with never re-queueing a job to guarantee at most once deliverability when a job is set to RETRY 0, but in this case the job will never be delivered unless it is retrieved from the original node it was queued in before the original node delivers it to another node as part of YOURJOBS.

I hope that makes sense, it is highly likely that I overlooked something or am not understanding the code correctly!

sunheehnus commented 9 years ago

Looks more correct to me. :+1: As we are just moving an at-most-once job from the queue of node2 to node1, the job should be queued again in node2.

antirez commented 9 years ago

Hello, it's a bug indeed, since YOURJOBS messages are atomic, the jobs are removed from the current queue before sending them to the other peer. Fixing.

antirez commented 9 years ago

Fixed, tests added. Thanks for reporting.

antirez / disque

At-most-once jobs are lost in cluster exchange #100