jobs.id uniqueness constraint error when restarting job

boyangzhao commented 2 years ago

When I am trying to restart a run, a worker job failed with the following message (see below). It is mentioning something to do with jobs id not being unique in cache.

[2022-11-01T17:23:24+0000] [MainThread] [D] [toil.deferred] Opened with own state file /var/run/user/0/toil/3ed48ecf21f35b2890739413e3612fbf/deferred/func5yd1ta4l
[2022-11-01T17:23:24+0000] [MainThread] [D] [toil.jobStores.aws.jobStore] Loaded job 36c0d872-7166-4918-8351-bc7cf1799679
[2022-11-01T17:23:24+0000] [MainThread] [D] [toil.worker] Parsed job description
[2022-11-01T17:23:24+0000] [MainThread] [I] [toil.worker] Working on job 'CWLJob' align 36c0d872-7166-4918-8351-bc7cf1799679 v1
[2022-11-01T17:23:24+0000] [MainThread] [D] [toil.worker] Got a command to run: _toil ea9ad1b9-c2e8-42bc-a07a-03d2ef0eeadd /usr/local/lib/python3.8/dist-packages toil.cwl.cwltoil False
[2022-11-01T17:23:24+0000] [MainThread] [D] [toil.job] Loading user module ModuleDescriptor(dirPath='/usr/local/lib/python3.8/dist-packages', name='toil.cwl.cwltoil', fromVirtualEnv=False).
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.jobStores.aws.jobStore] Reading FileInfo(fileID='ea9ad1b9-c2e8-42bc-a07a-03d2ef0eeadd', ownerID='36c0d872-7166-4918-8351-bc7cf1799679', encrypted=False, version='FYkc40eoDoH9yEO2hmeN.TOMHMeNkn7m', previousVersion='FYkc40eoDoH9yEO2hmeN.TOMHMeNkn7m', content=None, checksum='sha1$ff48b877dd1...2b01375b97069327f', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f9332ed0ca0> into stream.
[2022-11-01T17:23:25+0000] [Thread-6  ] [D] [toil.jobStores.aws.jobStore] Starting sha1 checksum to match ff48b877dd145a84adfd30d2b01375b97069327f
[2022-11-01T17:23:25+0000] [Thread-6  ] [D] [toil.jobStores.aws.jobStore] Completed checksum with hash ff48b877dd145a84adfd30d2b01375b97069327f vs. expected ff48b877dd145a84adfd30d2b01375b97069327f
[2022-11-01T17:23:25+0000] [MainThread] [I] [toil.worker] Loaded body Job('CWLJob' align 36c0d872-7166-4918-8351-bc7cf1799679 v1) from description 'CWLJob' align 36c0d872-7166-4918-8351-bc7cf1799679 v1
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.fileStores.cachingFileStore] Starting job ('CWLJob' align 36c0d872-7166-4918-8351-bc7cf1799679 v1) with ID (36c0d872-7166-4918-8351-bc7cf1799679).
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Running orphaned deferred functions
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Ran orphaned deferred functions from 0 abandoned state files
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Running job
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.fileStores.cachingFileStore] Actually running job ('CWLJob' align 36c0d872-7166-4918-8351-bc7cf1799679 v1) with ID (36c0d872-7166-4918-8351-bc7cf1799679) which wants 2147483648 of our 992536899584 bytes.
[2022-11-01T17:23:25+0000] [MainThread] [E] [root] Error talking to caching database: UNIQUE constraint failed: jobs.id
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Running own deferred functions
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Out of deferred functions!
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Running orphaned deferred functions
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.deferred] Ran orphaned deferred functions from 0 abandoned state files
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/toil/worker.py", line 394, in workerScript
    with fileStore.open(job):
  File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.8/dist-packages/toil/fileStores/cachingFileStore.py", line 990, in open
    self._allocateSpaceForJob(self.jobDiskBytes)
  File "/usr/local/lib/python3.8/dist-packages/toil/fileStores/cachingFileStore.py", line 795, in _allocateSpaceForJob
    self._write([('INSERT INTO jobs VALUES (?, ?, ?, ?)', (self.jobID, self.localTempDir, newJobReqs, me))])
  File "/usr/local/lib/python3.8/dist-packages/toil/fileStores/cachingFileStore.py", line 331, in _write
    return self._staticWrite(self.con, self.cur, operations)
  File "/usr/local/lib/python3.8/dist-packages/toil/lib/retry.py", line 293, in call
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/toil/fileStores/cachingFileStore.py", line 305, in _staticWrite
    raise e
  File "/usr/local/lib/python3.8/dist-packages/toil/fileStores/cachingFileStore.py", line 292, in _staticWrite
    cur.execute(command, args)
sqlite3.IntegrityError: UNIQUE constraint failed: jobs.id
[2022-11-01T17:23:25+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host ip-172-31-39-15
[2022-11-01T17:23:25+0000] [MainThread] [D] [toil.jobStores.aws.jobStore] Loaded job 36c0d872-7166-4918-8351-bc7cf1799679

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1237

adamnovak commented 2 years ago

This happens when a job is still logged in the caching database as running and consuming disk space, and the same job tries to start again on the same machine.

You're sure you didn't manage to have two attempts to run the same job happening at the same time? It's possible for the leader to be forcibly killed without having time to delete all its submitted jobs to the backing scheduler, and then started again, and have a job that is still running from the last attempt still be running while the same job tries to run again. Or you could try to restart the same workflow twice in parallel.

Before we try to add the current job to the database, we already call _removeDeadJobs(), so if the other instance of the job isn't actually running in a live local process, we should have removed it from the database already.

To fix this, we probably just want to separate the whole cache out by workflow attempt number. We shouldn't be talking to zombie jobs from previous attempts of the same workflow even if they are still running; we're no longer responsible for their disk space.

boyangzhao commented 2 years ago

I've used toil kill to kill the running job, waited for a pretty long time and then reran the job. What I noticed on the mesos monitoring is that when the job was killed by toil, the framework still existed and the worker node was still kept. So when I started the new run on lead node, it created a new framework in addition to the existing. When I removed all the nodes by destroying the cluster (toil destroy-cluster) (this killed off all running ec2 instances), and then restarted a new clustertoil launch-cluster`, the error disappeared.

Does the restart feature attempts to run only using the previous cwl workflow and yaml inputs (somehow already cached in the jobstore?) ? I tested it a bit, and it seems if a job failed, I fixed a part of the workflow (or change some of the inputs defined in yaml), the restart is ignoring it. In fact toil-cwl-runner with restart arg seems to ignore the cwl and yaml args, even though it still required a cwl arg (but it is not looking at it). So this means if I have a big part of the large workflow already completed and won't change, and I make an update to part of workflow where it failed, ideally restart would pick up where it left off but with the updated workflow, but it is using the previous workflow submitted?

Also everything in the jobstore is hashed. If there a way to pull out the outputs (the filename and which part of the workflow it came from) generated so far during a run (or if a run failed halfway) from the jobstore?

DataBiosphere / toil

jobs.id uniqueness constraint error when restarting job #4262