SEL-Columbia / modelrunner

Framework for running models as long running jobs via the web
1 stars 2 forks source link

Jobs lost on worker restart #44

Closed chrisnatali closed 8 years ago

chrisnatali commented 9 years ago

It appears that when restarting a worker without restarting primary can cause jobs to be lost (or stuck in "CREATED" state). This may be due to the redis blocking pop (blpop) call remaining in place even when the process that invoked it is killed (leaving a zombie?)...at least that's what the behavior would indicate (new job pushed onto queue and immediately vanishing).

For now, it's probably safest to restart the primary and then workers to ensure that redis queues are cleared of blocking calls.

chrisnatali commented 8 years ago

Related to #55.

chrisnatali commented 8 years ago

Could not reproduce this in test environment with primary and 2 workers running in docker. Workers only had the test model started.

Tried:

  1. starting workers
    • each worker had a single connection to redis
  2. stopping workers
    • no workers had connections to redis
  3. queuing new test
    • test was queued to modelrunner:queues:test list
  4. start workers
    • worker picks up test job as expected
chrisnatali commented 8 years ago

Doesn't seem to be happening with release 0.6