autolab / Tango

Standalone RESTful autograding service
http://www.autolabproject.com/
Apache License 2.0
48 stars 59 forks source link

Catch exception if job already removed from unassignedJob queue #207

Closed fanpu closed 3 years ago

fanpu commented 3 years ago

Addresses https://github.com/autolab/Tango/pull/196#issuecomment-789041313

There is a bug in jobQueue that causes it to raise an exception when attempting to remove a live job from the unassignedJob queue when it has already been removed previously/

This was because there was a wrong assumption that if a job is live, it must still be in the unassigned queue.

A live job can either be:

  1. running (assigned to vm)
  2. In the unassigned queue and waiting to be run on a vm.

The makeDead function is supposed to shift any live job to the dead job dictionary. So on line 312 we wrongly assumed that a live job has to be in the unassigned queue (it could be running) and hence will not be found in the unassigned queue.

This PR catches the ValueError that .remove may raise, and logs a message.

cg2v commented 3 years ago

While I'm sure this will prevent the exception, it seems like you're misunderstanding your state machine. jobQueue.makeDead can be called from 4 places:

  1. Exception handler in jobManager.__manage
  2. worker.rescheduleJob, when exhausting retries
  3. worker.run, after job succeeds
  4. jobQueue.delJob

In the first 3 cases, the job cannot be an unassigned job, because a) that's where it comes from in manage, and b) worker can only get a job that manage gave it.

delJob has no callers (I assume there's some idea of having a more complete REST interface, but it's not hooked up except in tests). IMO, The correct thing there would be to reject deleting an assigned job, and not calling makeDead (e.g. inlining the parts of makeDead that make sense for the delJob case)

fanpu commented 3 years ago

Thanks for your insights! I'll admit that I am still far from familiar with the Tango codebase, and am still learning more about it as I go.

IMO, The correct thing there would be to reject deleting an assigned job, and not calling makeDead (e.g. inlining the parts of makeDead that make sense for the delJob case)

If a job was assigned, wouldn't we still need to remove it from the liveJobs dictionary? I might be misunderstanding something here, but I thought that it will still be necessary to do clean-up in the event that an assigned job encounters an exception.