ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
528 stars 111 forks source link

toil bug in preprocess stage? #1370

Closed AryehMiller closed 7 months ago

AryehMiller commented 7 months ago

Hi there! I'm currently trying to align several genomes in Cactus using the step-by-step approach (generated from cactus-prepare), and finishing up the cactus-preprocess step.

Here's my command:

cactus-preprocess /scratch/am3329/cactus/anoles/0 /projects/f_geneva_1/aryehmiller/cactus/LizardSeqsFinal.txt /projects/f_geneva_1/aryehmiller/cactus/AliOut/LizardSeqsFinal.txt --inputNames AnoFre AnoChl AnoSem AnoBar AnoCar AnoCyb AnoDis AnoSag AnoAur AnoGra AnoVal --maxCores 40 --maxMemory 300G --restart --logFile /projects/f_geneva_1/aryehmiller/cactus/AliOut/logs/preprocess-AnoFre.log

Everything ran fine for several days, but now I'm running into a persistent error in the SLURM log returning some sort of toil DAG related error? I've printed the toil information for the jobStore below-- perhaps this is a bug with how toil is marking and queueing jobs? Let me know if you have any thoughts on how to resolve this one, and many thanks!

(base) [am3329@gpuc001 out]$ toil status --printPerJobStats --printChildren --printLogs /scratch/am3329/cactus/anoles/0
[2024-04-26T10:20:31-0400] [MainThread] [W] [toil.common] Contradicting options passed: Clean flag is set to onSuccess despite the stats flag requiring the jobStore to be intact at the end of the run. Setting clean to 'never'.
[2024-04-26T10:20:31-0400] [MainThread] [I] [toil.utils.toilStatus] Traversing the job graph gathering jobs. This may take a couple of minutes.
JOB:'unzip_gzs' kind-unzip_then_pp/instance-udhyiiqp v4 LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:TrueIS_ZOMBIE:False  HAS_SERVICES:False      IS_SERVICE:False
JOB:'CactusPreprocessor' kind-CactusPreprocessor/instance-g4wvrnx3 v5   LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:True IS_ZOMBIE:False HAS_SERVICES:False      IS_SERVICE:False
JOB:'zip_gzs' kind-zip_gzs/instance-gfdji43_ v1 LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:True       IS_ZOMBIE:False   HAS_SERVICES:False      IS_SERVICE:False
JOB:'BatchPreprocessor' kind-CactusPreprocessor2/instance-euk_5lqu v5   LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:True IS_ZOMBIE:False HAS_SERVICES:False      IS_SERVICE:False
JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-gmo1ttpi v4     LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:True IS_ZOMBIE:False HAS_SERVICES:False      IS_SERVICE:False
JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-xghrmwmo v4     LOG_FILE:None   TRYS_REMAINING:2        CHILD_NUMBER:0  READY_TO_RUN:True IS_ZOMBIE:False HAS_SERVICES:False      IS_SERVICE:False
JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-zaqwsmhn v18    LOG_FILE:files/for-job/kind-BatchPreprocessor/instance-zaqwsmhn/cleanup/file-4a916fbeb45945b9bb7bd7464f8d47a0/stream      TRYS_REMAINING:0        CHILD_NUMBER:0  READY_TO_RUN:True       IS_ZOMBIE:False   HAS_SERVICES:False      IS_SERVICE:False
JOB:'clean_if_different' kind-clean_if_different/instance-0vyqx418 v18  LOG_FILE:files/for-job/kind-clean_if_different/instance-0vyqx418/cleanup/file-d8a72ef068854367b7423b84399d20c7/stream     TRYS_REMAINING:0        CHILD_NUMBER:0  READY_TO_RUN:True       IS_ZOMBIE:False   HAS_SERVICES:False      IS_SERVICE:False
LOG_FILE_OF_JOB: 'unzip_gzs' kind-unzip_then_pp/instance-udhyiiqp v4 LOG: Job has no log file
LOG_FILE_OF_JOB: 'CactusPreprocessor' kind-CactusPreprocessor/instance-g4wvrnx3 v5 LOG: Job has no log file
LOG_FILE_OF_JOB: 'zip_gzs' kind-zip_gzs/instance-gfdji43_ v1 LOG: Job has no log file
LOG_FILE_OF_JOB: 'BatchPreprocessor' kind-CactusPreprocessor2/instance-euk_5lqu v5 LOG: Job has no log file
LOG_FILE_OF_JOB: 'BatchPreprocessor' kind-BatchPreprocessor/instance-gmo1ttpi v4 LOG: Job has no log file
LOG_FILE_OF_JOB: 'BatchPreprocessor' kind-BatchPreprocessor/instance-xghrmwmo v4 LOG: Job has no log file
Log from job "LOG_FILE_OF_JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-zaqwsmhn v18 LOG:" follows:
=========>
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host memc001.amarel.rutgers.edu.
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil.worker] Working on job 'BatchPreprocessor' kind-BatchPreprocessor/instance-zaqwsmhn v16
        Traceback (most recent call last):
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/worker.py", line 381, in workerScript
            job = Job.loadJob(jobStore, jobDesc)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 2714, in loadJob
            job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 2333, in _unpickle
            runnable = unpickler.load()
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3345, in __new__
            return cls._resolve(*args)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3357, in _resolve
            value = safeUnpickleFromStream(fileHandle)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/common.py", line 1674, in safeUnpickleFromStream
            return pickle.loads(string)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3345, in __new__
            return cls._resolve(*args)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3357, in _resolve
            value = safeUnpickleFromStream(fileHandle)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/common.py", line 1674, in safeUnpickleFromStream
            return pickle.loads(string)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3479, in __setstate__
            raise RuntimeError(
        RuntimeError: This job was passed promise files/no-job/file-c9e264e7c95346b288ccec1fe62d80d0/stream that wasn't yet resolved when it ran. The job 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0 that fulfills this promise hasn't yet finished. This means that there aren't enough constraints to ensure the current job always runs after 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0. Consider adding a follow-on indirection between this job and its parent, or adding this job as a child/follow-on of 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0.
        [2024-04-26T10:08:20-0400] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host memc001.amarel.rutgers.edu
<=========
Log from job "LOG_FILE_OF_JOB:'clean_if_different' kind-clean_if_different/instance-0vyqx418 v18 LOG:" follows:
=========>
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host memc001.amarel.rutgers.edu.
        [2024-04-26T10:08:20-0400] [MainThread] [I] [toil.worker] Working on job 'clean_if_different' kind-clean_if_different/instance-0vyqx418 v16
        Traceback (most recent call last):
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/worker.py", line 381, in workerScript
            job = Job.loadJob(jobStore, jobDesc)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 2714, in loadJob
            job = cls._unpickle(userModule, fileHandle, requireInstanceOf=Job)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 2333, in _unpickle
            runnable = unpickler.load()
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3345, in __new__
            return cls._resolve(*args)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3357, in _resolve
            value = safeUnpickleFromStream(fileHandle)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/common.py", line 1674, in safeUnpickleFromStream
            return pickle.loads(string)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3345, in __new__
            return cls._resolve(*args)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3357, in _resolve
            value = safeUnpickleFromStream(fileHandle)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/common.py", line 1674, in safeUnpickleFromStream
            return pickle.loads(string)
          File "/projectsc/f_geneva_1/aryehmiller/cactus/cactus-bin-v2.8.1/venv-cactus-v2.8.1/lib/python3.9/site-packages/toil/job.py", line 3479, in __setstate__
            raise RuntimeError(
        RuntimeError: This job was passed promise files/no-job/file-d684829225844459a4b1c63b15fb0832/stream that wasn't yet resolved when it ran. The job 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0 that fulfills this promise hasn't yet finished. This means that there aren't enough constraints to ensure the current job always runs after 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0. Consider adding a follow-on indirection between this job and its parent, or adding this job as a child/follow-on of 'RedMaskJob' kind-RedMaskJob/instance-tzgklhde v0.
        [2024-04-26T10:08:20-0400] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host memc001.amarel.rutgers.edu
<=========
CHILDREN_OF_JOB:'unzip_gzs' kind-unzip_then_pp/instance-udhyiiqp v4     (CHILD_JOB:kind-CactusPreprocessor/instance-g4wvrnx3,PRECEDENCE:0)
CHILDREN_OF_JOB:'CactusPreprocessor' kind-CactusPreprocessor/instance-g4wvrnx3 v5       (CHILD_JOB:kind-zip_gzs/instance-gfdji43_,PRECEDENCE:0)   (CHILD_JOB:kind-CactusPreprocessor2/instance-euk_5lqu,PRECEDENCE:1)
CHILDREN_OF_JOB:'zip_gzs' kind-zip_gzs/instance-gfdji43_ v1 
CHILDREN_OF_JOB:'BatchPreprocessor' kind-CactusPreprocessor2/instance-euk_5lqu v5       (CHILD_JOB:kind-BatchPreprocessor/instance-gmo1ttpi,PRECEDENCE:0)
CHILDREN_OF_JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-gmo1ttpi v4         (CHILD_JOB:kind-BatchPreprocessor/instance-xghrmwmo,PRECEDENCE:0)
CHILDREN_OF_JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-xghrmwmo v4         (CHILD_JOB:kind-BatchPreprocessor/instance-zaqwsmhn,PRECEDENCE:0) (CHILD_JOB:kind-clean_if_different/instance-0vyqx418,PRECEDENCE:0)
CHILDREN_OF_JOB:'BatchPreprocessor' kind-BatchPreprocessor/instance-zaqwsmhn v18 
CHILDREN_OF_JOB:'clean_if_different' kind-clean_if_different/instance-0vyqx418 v18 
Of the 8 jobs considered, there are 5 jobs with children, 3 jobs ready to run, 0 zombie jobs, 0 jobs with services, 0 services, and 2 jobs with log files currently in FileJobStore(/scratch/am3329/cactus/anoles/0).
glennhickey commented 7 months ago

This type of error is generally due to a bug in Cactus, not Toil. I'm having a lot of trouble reproducing though. Did it happen before you ran with --restart? Do you have the original logs from all your runs?

I did notice a separate bug which I fixed here: https://github.com/ComparativeGenomicsToolkit/cactus/pull/1371

The upshot of this one is that by only running the RED masking (not not lastz), the command should run about 100X faster. You can achieve this with the current release by adding --maskMode red.

AryehMiller commented 7 months ago

Hi Glenn, thanks for your quick response! I just e-mailed you the full log file from this failed run. I believe the run did indeed work before I ran with --restart, but can't seem to find the log file for that run (it was with a different dataset).

Given the resolution in #1371, do you recommend beginning a new run, without --restart, and adding in --maskMode red flag? Thanks!

glennhickey commented 7 months ago

The log you sent is with --restart -- it's the log from the first run before running again with --restart that I'm after. The reason is I can't reproduce it locally, so I'm wondering (and this has happened before) if the error is particular to running with --restart -- I've tried aborting it myself and restarting without much luck.

Anyway, yes my suggestion is definitely to rerun, from scratch, using --maskMode red -- this is what the default behaviour should be, if it weren't for bug #1371 (and will be in the next release). This shouldn't take too long : a 3GB genome typically runs in under 2 hours on 1 core. Even if your input is already masked with other tools, this will often save you tons of time and memory in subsequent stages of the pipeline.

AryehMiller commented 7 months ago

Can't seem to find the log from the first run without --restart, but have added in --maskMode red in a run from scratch, and the preprocess run step ran cleanly and finished in 35 mins! Thanks for helping to resolve this!