NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

Track jobs status in the `runs` table #350

Closed jeanetteclark closed 1 year ago

jeanetteclark commented 1 year ago

related to: https://github.com/NCEAS/metadig-engine/issues/327

If we use a preemptive ack to keep unclosed connections from piling up, we need to keep track of job status. The code should:

  1. look for an entry in the runs table with that metadata pid using Model.Run.getRun() [Worker:469]
  2. if none exists, insert an entry to the runs postgres table with a status of "running" and run_count to 1 when the worker receives a rabbitMQ message from the controller [TODO Worker:480]. if an entry already exists, set run_count to n+1. If run_count > 10 (or some other number) update status to "failed" and exit usingWorker.Run.save()
  3. run the checks
  4. update the runs table entry so the status is "success" when the worker finishes
  5. periodically sweep the runs table for entries where the status is "processing" and the timestamp is 24 hours old (or some similar timeframe)
  6. requeue the running jobs, sending the flow back to the worker which starts at step 0 above
jeanetteclark commented 1 year ago

after talking to Matt last week we decided that the audits for dangling jobs should be done in the controller class, either using quartz or by creating a new thread. The method to retrieve pids for dangling jobs is okay where it is but should be defined for all of the stores (local and fileSystem).

jeanetteclark commented 1 year ago

so, so far we have:

a new controller method montitor() which calls the MonitorJob quartz job subclass. MonitorJob queries the DB for runs where the status is "processing" and timestamp is > 24 hours old any runs returned are then submitted to processQualityRequest() from the Controller class

things not yet done:

jeanetteclark commented 1 year ago

got a working version of the monitor method with MonitorJob that successfully gets a run stuck in "processing" for more than 24 hours in the sql db, resubmits it to the worker, which (in this test) re-ran the job and changed the status in the db to success. yay!

still to do:

mbjones commented 1 year ago

🏆 Nicely done.

jeanetteclark commented 1 year ago

I think I've tested everything I can test as a unit test. without building an integration test framework, the best testing I can do is setting up some local scenarios which I'll describe below. Its very manual and a bit of a hack unfortunately.

To confirm that the RMQ bug is fixed, recreate it by:

To confirm that the quartz job picks up pids stuck in processing (scenario when worker dies after it acks the message from controller), while the controller and worker are running, trick the controller into finding an old run stuck in processing:

All of this is working for me. @mbjones I know my tests are a hack but after talking Wednesday this is the path forward to release I think. Let me know if you want to see anything else before a PR to develop

mbjones commented 1 year ago

LGTM. Let's discuss getting db testing into the framework to simplify your future testing.

jeanetteclark commented 1 year ago

This is finished, and working correctly (deployed on dev cluster in the snapshot release)