Track jobs status in the `runs` table

jeanetteclark commented 1 year ago

If we use a preemptive ack to keep unclosed connections from piling up, we need to keep track of job status. The code should:

look for an entry in the runs table with that metadata pid using Model.Run.getRun() [Worker:469]
if none exists, insert an entry to the runs postgres table with a status of "running" and run_count to 1 when the worker receives a rabbitMQ message from the controller [TODO Worker:480]. if an entry already exists, set run_count to n+1. If run_count > 10 (or some other number) update status to "failed" and exit usingWorker.Run.save()
run the checks
update the runs table entry so the status is "success" when the worker finishes
periodically sweep the runs table for entries where the status is "processing" and the timestamp is 24 hours old (or some similar timeframe)
requeue the running jobs, sending the flow back to the worker which starts at step 0 above

jeanetteclark commented 1 year ago

after talking to Matt last week we decided that the audits for dangling jobs should be done in the controller class, either using quartz or by creating a new thread. The method to retrieve pids for dangling jobs is okay where it is but should be defined for all of the stores (local and fileSystem).

jeanetteclark commented 1 year ago

so, so far we have:

a new controller method montitor() which calls the MonitorJob quartz job subclass. MonitorJob queries the DB for runs where the status is "processing" and timestamp is > 24 hours old any runs returned are then submitted to processQualityRequest() from the Controller class

things not yet done:

add the processing status insert - before ack in the worker
implementing the try counter
better documentation
making it configurable

jeanetteclark commented 1 year ago

got a working version of the monitor method with MonitorJob that successfully gets a run stuck in "processing" for more than 24 hours in the sql db, resubmits it to the worker, which (in this test) re-ran the job and changed the status in the db to success. yay!

still to do:

[x] - add the processing status insert - before ack in the worker
[x] - implementing the try counter
[x] - better documentation
[x] - making it configurable
[ ] - writing another test?

mbjones commented 1 year ago

🏆 Nicely done.

jeanetteclark commented 1 year ago

I think I've tested everything I can test as a unit test. without building an integration test framework, the best testing I can do is setting up some local scenarios which I'll describe below. Its very manual and a bit of a hack unfortunately.

To confirm that the RMQ bug is fixed, recreate it by:

set RMQ timeout to something really short (5 seconds) in /opt/homebrew/etc/rabbitmq/rabbitmq.conf
```
consumer_timeout = 10000
log.console = true
```
use a test check that includes a sys.sleep(60)
move the basicAck on Worker:193 to Worker:403
start a worker and controller, and run bin/sendAssessmentTest.py
observe the timeout error
change the Worker code back to the tip of the branch
run bin/sendAssessmentTest.py again
observe no error and successful insert in DB

To confirm that the quartz job picks up pids stuck in processing (scenario when worker dies after it acks the message from controller), while the controller and worker are running, trick the controller into finding an old run stuck in processing:

update runs set status='processing',timestamp='2022-05-16 11:26:38.932-07' where status='success';
observe the controller pick up the job, it gets passed to the worker, status is updated in DB
confirm that the run_count is incremented correctly here as well

All of this is working for me. @mbjones I know my tests are a hack but after talking Wednesday this is the path forward to release I think. Let me know if you want to see anything else before a PR to develop

mbjones commented 1 year ago

LGTM. Let's discuss getting db testing into the framework to simplify your future testing.

jeanetteclark commented 1 year ago

This is finished, and working correctly (deployed on dev cluster in the snapshot release)

NCEAS / metadig-engine

Track jobs status in the `runs` table #350