DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

Unable to restart failed job due to NoSuchFileException #4504

Closed glennhickey closed 1 year ago

glennhickey commented 1 year ago

I have a job that's running out of memory on slurm and getting killed. When I go to restart it with singleMachine (which normally works fine), I get an error like this

[2023-06-19T06:47:30-0700] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'vg_to_og' kind-clip_vg/instance-fh8zs_29 v36
Exit reason: None
[2023-06-19T06:47:30-0700] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'vg_to_og' kind-clip_vg/instance-fh8zs_29 v38
[2023-06-19T06:47:30-0700] [MainThread] [W] [toil.leader] Log from job "kind-clip_vg/instance-fh8zs_29" follows:
=========>
        [2023-06-19T06:47:29-0700] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2023-06-19T06:47:29-0700] [MainThread] [I] [toil] Running Toil version 5.11.0-9a04dabb36d6ab13ed1ac7c711dbdc8c71724dc9 on host mustard.prism.
        [2023-06-19T06:47:29-0700] [MainThread] [I] [toil.worker] Working on job 'vg_to_og' kind-clip_vg/instance-fh8zs_29 v37
        Traceback (most recent call last):
          File "/private/home/hickey/dev/cactus.pangenome/venv-cactus-pangenome/lib/python3.10/site-packages/toil/worker.py", line 377, in workerScript
            job = Job.loadJob(jobStore, jobDesc)
          File "/private/home/hickey/dev/cactus.pangenome/venv-cactus-pangenome/lib/python3.10/site-packages/toil/job.py", line 2657, in loadJob  
            with manager as fileHandle:
          File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
            return next(self.gen)
          File "/private/home/hickey/dev/cactus.pangenome/venv-cactus-pangenome/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 657, in read_file_stream
            self._check_job_store_file_id(file_id)
          File "/private/home/hickey/dev/cactus.pangenome/venv-cactus-pangenome/lib/python3.10/site-packages/toil/jobStores/fileJobStore.py", line 839, in _check_job_store_file_id
            raise NoSuchFileException(jobStoreFileID)
        toil.jobStores.abstractJobStore.NoSuchFileException: File 'files/for-job/kind-vg_to_og/instance-yybmsm3i/cleanup/file-c87a04ee86fc4e938468eed073ab3462/stream' does not exist.
        [2023-06-19T06:47:30-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host mustard.prism

This happens every time: I run the workflow from the beginning; it dies, it won't restart. I rerun from the beginning with a bit more memory, it dies again, won't restart etc.

This is new code on my end, where vg_to_og is a FollowOn to clip_vg

hprc-jun16-slurm-mc.log:    [2023-06-17T09:54:18-0700] [MainThread] [I] [toil.worker] Chaining from 'clip_vg' kind-clip_vg/instance-fh8zs_29 v3 to 'vg_to_og' kind-vg_to_og/instance-yybmsm3i v1

but there doesn't seem anything really different about it. I've only ever seen this issue come up now using the latest Toil release.

@adamnovak writes

I don't think this is supposed to happen; we're supposed to only commit the job descriptions after the bodies are on disk, and we're supposed to leave the bodies around until the jobs that made them are cleaned up, which won't happen until after all their children/follow-ons finish.

It also looks like chaining is involved here, since the job's name isn't the same as the name that was used to make the ID it is under. Maybe there's a consistency error in the chaining logic?

There might be some kind of consistency bug then, related to cleanup-able files for a chained-to job not being cleaned up at the right time relative to deletion of the chained-from job description. I don't think we did anything in 5.11 that ought to introduce that, but it's possible.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1352

adamnovak commented 1 year ago

@glennhickey Are you 100% sure it is dying the first time due to OOM, if you can't actually give it enough memory that is succeeds?

If you can reproduce this every time, can you send a workflow commit we could use to reproduce it?

glennhickey commented 1 year ago

Well, the command dies. But goes through if I rerun but giving the job more memory. I will try to package up a way to reproduce and post it here.

adamnovak commented 1 year ago

OK, my current theory here is that, when we chain from one job to another, we don't delete the job we chained to because we need its body to remain in the job store. We only delete it once the job that chained to it finishes successfully.

If the job that was chained to fails, that never happens. So the chained-to job remains in the job store, as well as the job that chained to it and replaced it.

So I think both jobs end up trying to run from the same body file?

I think I need to take another look at the whole concept of jobsToDelete, and how we do the commits of job changes to the job store when chaining.

adamnovak commented 1 year ago

OK, maybe what is happening is, when we chain from one job to the next, we cut the successor relationship to that job. This makes the chained-to job no longer reachable from the root of the workflow, so on restart it is deleted and its body file is removed. But we need the body file to run the job that chained to it.

glennhickey commented 1 year ago

Thanks @adamnovak !!