Jobs lost on worker restart

chrisnatali commented 9 years ago

It appears that when restarting a worker without restarting primary can cause jobs to be lost (or stuck in "CREATED" state). This may be due to the redis blocking pop (blpop) call remaining in place even when the process that invoked it is killed (leaving a zombie?)...at least that's what the behavior would indicate (new job pushed onto queue and immediately vanishing).

For now, it's probably safest to restart the primary and then workers to ensure that redis queues are cleared of blocking calls.

chrisnatali commented 8 years ago

Related to #55.

chrisnatali commented 8 years ago

Could not reproduce this in test environment with primary and 2 workers running in docker. Workers only had the test model started.

Tried:

starting workers
- each worker had a single connection to redis
stopping workers
- no workers had connections to redis
queuing new test
- test was queued to modelrunner:queues:test list
start workers
- worker picks up test job as expected

chrisnatali commented 8 years ago

Doesn't seem to be happening with release 0.6

SEL-Columbia / modelrunner

Jobs lost on worker restart #44