ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

Failed cactus-blast job #320

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Hi

I've just ran a cactus-blast job, based on the cactus-prepare output, using the following command:

cactus-blast /cactus/jobStore /data/pangenome_cactus/steps-output/pangenome_guidetree.txt /data/pangenome_cactus/steps-output/Anc01.cigar --root Anc01 --workDir=/cactus/workDir --defaultDisk 50G --binariesMode local

However after a few days it seems to have failed, and I'm not sure why. I'm using the latest Cactus build from Docker.

Could the problem be inadequate RAM on the node I'm running on? I noticed there was a recent memory leak fix with cactus-blast.

The error messages are below:

[2020-09-30T04:44:47+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts2' kind-CollateBlasts2/instance-bm2dt8m6 with job batch system ID: 68663 and cores: 1, disk: 58.3 G, and memory: 2.0 G
[2020-09-30T04:44:47+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpcoti4mp5/worker_log.txt
[2020-09-30T04:44:48+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts2' kind-CollateBlasts2/instance-bm2dt8m6
[2020-09-30T04:44:48+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts' kind-CollateBlasts/instance-hv1qhd2h with job batch system ID: 68664 and cores: 1, disk: 50.0 G, and memory: 2.0 G
[2020-09-30T04:44:49+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpbw34mcrc/worker_log.txt
[2020-09-30T04:44:49+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts' kind-CollateBlasts/instance-hv1qhd2h
[2020-09-30T04:44:49+0000] [MainThread] [I] [toil.leader] Issued job 'MakeOffDiagonalBlasts' kind-MakeOffDiagonalBlasts/instance-b2n1tb_z with job batch system ID: 68665 and cores: 1, disk: 50.0 G, and memory: 2.0 G
[2020-09-30T04:44:50+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmps694sw1l/worker_log.txt
[2020-09-30T04:47:46+0000] [MainThread] [I] [toil.leader] Job ended: 'MakeOffDiagonalBlasts' kind-MakeOffDiagonalBlasts/instance-b2n1tb_z
[2020-09-30T04:47:46+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts' kind-CollateBlasts/instance-cpgrt72g with job batch system ID: 68666 and cores: 1, disk: 50.0 G, and memory: 2.0 G
[2020-09-30T04:47:46+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpweghp1r0/worker_log.txt
[2020-09-30T04:47:47+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts' kind-CollateBlasts/instance-cpgrt72g
[2020-09-30T04:47:47+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts2' kind-CollateBlasts2/instance-6_5skl83 with job batch system ID: 68667 and cores: 1, disk: 60.1 G, and memory: 2.0 G
[2020-09-30T04:47:48+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmp_hb6127o/worker_log.txt
[2020-09-30T04:57:29+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts2' kind-CollateBlasts2/instance-6_5skl83
[2020-09-30T04:57:29+0000] [MainThread] [I] [toil.leader] Issued job 'DeleteFileIDs' kind-DeleteFileIDs/instance-gdsfnxdc with job batch system ID: 68668 and cores: 1, disk: 50.0 G, and memory: 2.0 G
[2020-09-30T04:57:30+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpyobxwq08/worker_log.txt
[2020-09-30T04:57:31+0000] [MainThread] [I] [toil.leader] Job ended: 'DeleteFileIDs' kind-DeleteFileIDs/instance-gdsfnxdc
[2020-09-30T04:57:31+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts2' kind-CollateBlasts2/instance-6_5skl83 with job batch system ID: 68669 and cores: 1, disk: 60.1 G, and memory: 2.0 G
[2020-09-30T04:57:31+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmp3ccua_62/worker_log.txt
[2020-09-30T04:57:32+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts2' kind-CollateBlasts2/instance-6_5skl83
[2020-09-30T04:57:32+0000] [MainThread] [I] [toil.leader] Issued job 'CollateBlasts' kind-CollateBlasts/instance-cpgrt72g with job batch system ID: 68670 and cores: 1, disk: 50.0 G, and memory: 2.0 G
[2020-09-30T04:57:33+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpnrg4mut2/worker_log.txt
[2020-09-30T04:57:33+0000] [MainThread] [I] [toil.leader] Job ended: 'CollateBlasts' kind-CollateBlasts/instance-cpgrt72g
[2020-09-30T04:57:33+0000] [MainThread] [I] [toil.leader] Issued job 'BlastSequencesAllAgainstAll' kind-BlastSequencesAllAgainstAll/instance-si7moji9 with job batch system ID: 68671 and cores: 1, disk: 34.1 G, and memory: 2.0 G
[2020-09-30T04:57:33+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpavoudaey/worker_log.txt
[2020-09-30T04:57:37+0000] [MainThread] [I] [toil.leader] Job ended: 'BlastSequencesAllAgainstAll' kind-BlastSequencesAllAgainstAll/instance-si7moji9
[2020-09-30T04:57:46+0000] [MainThread] [I] [toil.leader] Finished toil run with 5 failed jobs.
[2020-09-30T04:57:46+0000] [MainThread] [I] [toil.leader] Failed jobs at end of the run: 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-kqoa1o4h 'EncapsulatedJob' kind-EncapsulatedJob/instance-1ywo8m5b 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w 'BlastIngroupsAndOutgroups' kind-BlastIngroupsAndOutgroups/instance-zms0homy 'BlastFirstOutgroup' kind-BlastFirstOutgroup/instance-lzubjx_m
{'HOR21599_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR21599_pseudomolecule_v1.fasta.masked', 'B1K-04-12_pseudomolecules_v1': '/data/pangenome_cactus/steps-output/B1K-04-12_pseudomolecules_v1.fasta.masked', 'RGT_Planet_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/RGT_Planet_pseudomolecule_v1.fasta.masked', 'Barke_pseudomolecules_v1': '/data/pangenome_cactus/steps-output/Barke_pseudomolecules_v1.fasta.masked', 'Anc17': '/data/pangenome_cactus/steps-output/Anc17.fa', 'Golden_Promise_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/Golden_Promise_pseudomolecule_v1.fasta.masked', 'Anc15': '/data/pangenome_cactus/steps-output/Anc15.fa', 'Hockett_pseudomolecules_v1': '/data/pangenome_cactus/steps-output/Hockett_pseudomolecules_v1.fasta.masked', 'Anc12': '/data/pangenome_cactus/steps-output/Anc12.fa', 'HOR8148_pseudomolecules_v1': '/data/pangenome_cactus/steps-output/HOR8148_pseudomolecules_v1.fasta.masked', 'HOR13821_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR13821_pseudomolecule_v1.fasta.masked', 'Anc13': '/data/pangenome_cactus/steps-output/Anc13.fa', 'Anc10': '/data/pangenome_cactus/steps-output/Anc10.fa', 'Morex_pseudomolecules_v2': '/data/pangenome_cactus/steps-output/Morex_pseudomolecules_v2.fasta.masked', 'Igri_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/Igri_pseudomolecule_v1.fasta.masked', 'HOR3081_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR3081_pseudomolecule_v1.fasta.masked', 'Anc16': '/data/pangenome_cactus/steps-output/Anc16.fa', 'HOR13942_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR13942_pseudomolecule_v1.fasta.masked', 'Anc14': '/data/pangenome_cactus/steps-output/Anc14.fa', 'Anc11': '/data/pangenome_cactus/steps-output/Anc11.fa', 'Anc07': '/data/pangenome_cactus/steps-output/Anc07.fa', 'HOR3365_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR3365_pseudomolecule_v1.fasta.masked', 'Anc04': '/data/pangenome_cactus/steps-output/Anc04.fa', 'HOR9043_pseudomolecule_v1.1': '/data/pangenome_cactus/steps-output/HOR9043_pseudomolecule_v1.1.fasta.masked', 'HOR10350_pseudomolecules_v1': '/data/pangenome_cactus/steps-output/HOR10350_pseudomolecules_v1.fasta.masked', 'Anc05': '/data/pangenome_cactus/steps-output/Anc05.fa', 'Anc02': '/data/pangenome_cactus/steps-output/Anc02.fa', 'ZDM02064_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/ZDM02064_pseudomolecule_v1.fasta.masked', 'Akashinriki_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/Akashinriki_pseudomolecule_v1.fasta.masked', 'Anc08': '/data/pangenome_cactus/steps-output/Anc08.fa', 'ZDM01467_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/ZDM01467_pseudomolecule_v1.fasta.masked', 'HOR7552_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/HOR7552_pseudomolecule_v1.fasta.masked', 'Anc09': '/data/pangenome_cactus/steps-output/Anc09.fa', 'Anc06': '/data/pangenome_cactus/steps-output/Anc06.fa', 'OUN333_pseudomolecule_v1': '/data/pangenome_cactus/steps-output/OUN333_pseudomolecule_v1.fasta.masked', 'Anc03': '/data/pangenome_cactus/steps-output/Anc03.fa', 'Anc00': '/data/pangenome_cactus/steps-output/Anc00.fa'}
('setting this', 'HOR21599_pseudomolecule_v1', 'files/no-job/file-299h2ksw/HOR21599_pseudomolecule_v1.fasta.masked')
('setting this', 'B1K-04-12_pseudomolecules_v1', 'files/no-job/file-gh48cw95/B1K-04-12_pseudomolecules_v1.fasta.masked')
('setting this', 'HOR13821_pseudomolecule_v1', 'files/no-job/file-1zab776t/HOR13821_pseudomolecule_v1.fasta.masked')
('setting this', 'HOR9043_pseudomolecule_v1.1', 'files/no-job/file-z3zm7vom/HOR9043_pseudomolecule_v1.1.fasta.masked')
('setting this', 'HOR10350_pseudomolecules_v1', 'files/no-job/file-pwg3fkjg/HOR10350_pseudomolecules_v1.fasta.masked')
Traceback (most recent call last):
  File "/usr/local/bin/cactus-blast", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 88, in main
    runCactusBlastOnly(options)
  File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 185, in runCactusBlastOnly
    outWorkFlowArgs = toil.start(CactusTrimmingBlastPhase(standAlone=True, cactusWorkflowArguments=workFlowArgs, phaseName="trimBlast"))
  File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 846, in start
    return self._runMainLoop(rootJobGraph)
  File "/usr/local/lib/python3.6/dist-packages/toil/common.py", line 1138, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/usr/local/lib/python3.6/dist-packages/toil/leader.py", line 269, in run
    raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException: The job store 'file:/cactus/jobStore' contains 5 failed jobs: 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-kqoa1o4h, 'EncapsulatedJob' kind-EncapsulatedJob/instance-1ywo8m5b, 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w, 'BlastIngroupsAndOutgroups' kind-BlastIngroupsAndOutgroups/instance-zms0homy, 'BlastFirstOutgroup' kind-BlastFirstOutgroup/instance-lzubjx_m
srun: error: node-18: task 0: Exited with exit code 1

Thank you for any help you can provide.

glennhickey commented 3 years ago

I don't see the errors in the snippet you posted. They must be further up the log. If it's memory-related (which isn't unusual when running a million blast jobs), you can try resuming with the --restart option which will attempt them again.

brettChapman commented 3 years ago

Thanks. I'll try and use the --restart option. Considering the number of blast jobs running, is it possible to split them up instead? The reason I first ran with cactus-prepare was to split up the whole problem into smaller jobs to get around my resource over-head issues, as I'm only running on a single node with 64GB RAM. Could I run cactus-blast with a trimmed down version of the guide tree, or with only a few genome sequences, or does this step need to include the entire guide tree and all sequences? I have multiple cactus-blast and cactus-align steps after this initial cactus-blast I'm running, outputting a different Anc0.cigar, Anc0.fa and Anc0*.hal file before appending them at the end.

I found the snippets from the log file, which indicate the failed jobs. It looks like the default memory is not large enough. Would increasing the default memory resolve the problem? I could try increasing the default disk and memory and try again to see if it completes.

Error block 1:

Traceback (most recent call last):
  File "/usr/local/bin/_toil_worker", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 590, in main
    sys.exit(workerScript(jobStore, config, jobName, jobStoreID))
  File "/usr/local/lib/python3.6/dist-packages/toil/worker.py", line 516, in workerScript
    jobGraph.logJobStoreFileID = jobStore.getEmptyFileStoreID(jobGraph.jobStoreID, cleanup=True)
  File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 401, in getEmptyFileStoreID
    with self.writeFileStream(jobStoreID, cleanup) as (fileHandle, jobStoreFileID):
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 392, in writeFileStream
    absPath = self._getUniqueFilePath('stream', jobStoreID, cleanup)
  File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 871, in _getUniqueFilePath
    directory = self._getFileDirectory(jobStoreID, cleanup)
  File "/usr/local/lib/python3.6/dist-packages/toil/jobStores/fileJobStore.py", line 905, in _getFileDirectory
    return tempfile.mkdtemp(prefix='file-', dir=jobFilesDir)
  File "/usr/lib/python3.6/tempfile.py", line 509, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [Errno 31] Too many links: '/cactus/jobStore/files/for-job/kind-BlastSequencesAgainstEachOther/instance-_a699a_w/cleanup/file-357synwe'
[2020-09-29T07:08:08+0000] [MainThread] [I] [toil.leader] Job ended: 'RunBlast' kind-RunBlast/e/instance-5mek1t68
[2020-09-29T07:08:08+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpkwm6wbo4/worker_log.txt
[2020-09-29T07:08:09+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpne072ldx/worker_log.txt
[2020-09-29T07:08:09+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpamrbkb8m/worker_log.txt
[2020-09-29T07:08:09+0000] [Thread-1  ] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker BlastSequencesAgainstEachOther file:/cactus/jobStore kind-BlastSequencesAgainstEachOther/instance-_a699a_w.
[2020-09-29T07:08:09+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w
[2020-09-29T07:08:09+0000] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w
[2020-09-29T07:08:13+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-29a916e8-19d1-496d-aa1e-eac0e7008ae7-39dc6c30-6ae0-46c7-810f-70275b6f1fdd/tmpcywkgfqc/worker_log.txt
[2020-09-29T07:08:13+0000] [MainThread] [W] [toil.jobGraph] Due to failure we are reducing the remaining retry count of job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w with ID kind-BlastSequencesAgainstEachOther/instance-_a699a_w to 1
[2020-09-29T07:08:13+0000] [MainThread] [W] [toil.jobGraph] We have increased the default memory of the failed job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w to 2147483648 bytes
[2020-09-29T07:08:13+0000] [MainThread] [W] [toil.jobGraph] We have increased the disk of the failed job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w to the default of 53687091200 bytes
[2020-09-29T07:08:13+0000] [MainThread] [I] [toil.leader] Issued job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w with job batch system ID: 2120 and cores: 1, disk: 50.0 G, and memory: 2.0 G

Error block 2:

[2020-09-29T07:59:27+0000] [Thread-1  ] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker BlastSequencesAgainstEachOther file:/cactus/jobStore kind-BlastSequencesAgainstEachOther/instance-_a699a_w.
[2020-09-29T07:59:27+0000] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w
[2020-09-29T07:59:27+0000] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w
[2020-09-29T07:59:27+0000] [MainThread] [W] [toil.jobGraph] Due to failure we are reducing the remaining retry count of job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w with ID kind-BlastSequencesAgainstEachOther/instance-_a699a_w to 0
[2020-09-29T07:59:27+0000] [MainThread] [W] [toil.leader] Job 'BlastSequencesAgainstEachOther' kind-BlastSequencesAgainstEachOther/instance-_a699a_w with ID kind-BlastSequencesAgainstEachOther/instance-_a699a_w is completely failed
glennhickey commented 3 years ago

I think the important part of that log is

File "/usr/lib/python3.6/tempfile.py", line 509, in mkdtemp
    _os.mkdir(file, 0o700)
OSError: [Errno 31] Too many links: '/cactus/jobStore/files/for-job/kind-BlastSequencesAgainstEachOther/instance-_a699a_w/cleanup/file-357synwe'

which I interpret to mean that cactus and toil are creating too many files for your OS's liking. I don't know if this is something related to the file system you're using, or if it's a limit associated with your account -- I've personally never run into it even on large cactus-blast runs.

One idea for a workaround is to copy this file and increase the chunks size here and pass it in with --configFile. This will reduce the number of blast jobs, and reduce the number of files toil uses to keep track of them. It will come at the cost of more memory needed per job, so increasing --defaultMemory should help. Reducing max_num_outgroups in that same config file will also reduce the number of jobs.

Another (and probably not so useful), is to switch to using an AWS S3 jobstore.

brettChapman commented 3 years ago

Thanks. I'll try using the config file with a different chunk size and max_num_outgroups. I'll let you know in a few days if it's resolved the problem.

I'm running on a cloud instance of Ubuntu 20.04 LTS, hosted by Pawsey (https://nimbus.pawsey.org.au/). I have Slurm configured across 20 nodes. I'm running Cactus on only one of the nodes within a Singularity container. Unfortunately we don't have the money or resources to run our jobs on AWS instances.

brettChapman commented 3 years ago

I've managed to run cactus-blast with the config file (chunk size set to 50000000 instead of 25000000) and max_num_outgroups set to 2 instead of 3. The job seemed to be running ok for a few days, much longer than it initially did.

For some reason slurm restarted the job (my sbatch script restarts the job with --restart if it is rerun for some reason - I find this only works if slurm restarts the job, and it just fails with read-only file system errors if I manually restart using the same image file).

After slurm restarted the job it failed with the following error about a local variable 'outWorkFlowArgs' referenced before assignment. It's probably an error in relation to the restart or a toil specific error because of the restart. I'll try and rerun from the beginning again and hopefully it doesn't restart. Is my choice of chunk size and max_num_outgroups appropriate or perhaps I should change them?

Thanks.

 srun -n 1 singularity exec --cleanenv --no-home --overlay jobStore-2113.img --bind /data/scratch/cactus-2113/tmp:/tmp /data/cactus_builds/cactus_latest.sif cactus-blast /cactus/jobStore /data/pangenome_cactus/steps-output/pangenome_guidetree.txt /data/pangenome_cactus/steps-output/Anc01.cigar --restart --configFile /data/pangenome_cactus/cactus_progressive_config.xml --root Anc01 --workDir=/cactus/workDir --retryCount 20 --defaultMemory 50G --defaultDisk 100G --binariesMode local
[2020-10-07T04:05:54+0000] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-CactusTrimmingBlastPhase/instance-urqnk6gv
[2020-10-07T04:05:54+0000] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-EncapsulatedJob/instance-30qba8ap
[2020-10-07T04:06:17+0000] [MainThread] [I] [toil] Running Toil version 4.2.0-3aa1da130141039cb357efe36d7df9b9f6ae9b5b on host node-8.
[2020-10-07T04:06:17+0000] [MainThread] [I] [toil.leader] Issued job 'mappingQualityRescoring' kind-mappingQualityRescoring/instance-4ofktay0 with job batch system ID: 0 and cores: 1, disk: 100.0 G, and memory: 50.0 G
[2020-10-07T04:06:17+0000] [MainThread] [I] [toil.leader] Issued job 'updateExpWrapperForOutgroups' kind-updateExpWrapperForOutgroups/instance-1iox8_bn with job batch system ID: 1 and cores: 1, disk: 100.0 G, and memory: 50.0 G
[2020-10-07T04:06:18+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmpi3ofzqjw/worker_log.txt
[2020-10-07T04:06:19+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T05:06:20+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T06:06:21+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T07:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T08:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T09:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T10:06:23+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T11:06:23+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T12:06:24+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T13:06:24+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T14:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T15:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T16:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T17:06:26+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run
[2020-10-07T18:00:29+0000] [Thread-3  ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Input cigar file has 98721860 lines
[2020-10-07T18:00:29+0000] [Thread-3  ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Filtered, non-overlapping primary cigar file has 137708381 lines
[2020-10-07T18:00:29+0000] [Thread-3  ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Filtered, non-overlapping secondary cigar file has 530385538 lines
[2020-10-07T18:00:29+0000] [Thread-3  ] [W] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Job used more disk than requested. Consider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job files/for-job/kind-CactusTrimmingBlastPhase/instance-urqnk6gv/cleanup/file-_y4h096w/stream used 118.98% (119.0 GB [127756722176B] used, 100.0 GB [107374182400B] requested) at the end of its run.
[2020-10-07T18:00:44+0000] [MainThread] [I] [toil.leader] Job ended: 'mappingQualityRescoring' kind-mappingQualityRescoring/instance-4ofktay0
[2020-10-07T18:00:50+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp3aydvfxt/worker_log.txt
[2020-10-07T18:00:57+0000] [MainThread] [I] [toil.leader] Job ended: 'updateExpWrapperForOutgroups' kind-updateExpWrapperForOutgroups/instance-1iox8_bn
[2020-10-07T18:00:57+0000] [MainThread] [I] [toil.leader] Issued job 'Job' kind-Job/instance-t7d52rr4 with job batch system ID: 2 and cores: 0, disk: 100.0 M, and memory: 512.0 M
[2020-10-07T18:00:58+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp17z_loex/worker_log.txt
[2020-10-07T18:01:10+0000] [MainThread] [I] [toil.leader] Job ended: 'Job' kind-Job/instance-t7d52rr4
[2020-10-07T18:01:10+0000] [MainThread] [I] [toil.leader] Issued job 'EncapsulatedJob' kind-EncapsulatedJob/instance-30qba8ap with job batch system ID: 3 and cores: 1, disk: 100.0 G, and memory: 2.0 G
[2020-10-07T18:01:11+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmpy9zmku__/worker_log.txt
[2020-10-07T18:01:12+0000] [MainThread] [I] [toil.leader] Job ended: 'EncapsulatedJob' kind-EncapsulatedJob/instance-30qba8ap
[2020-10-07T18:01:12+0000] [MainThread] [I] [toil.leader] Issued job 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-urqnk6gv with job batch system ID: 4 and cores: 1, disk: 100.0 G, and memory: 3.3 G
[2020-10-07T18:01:14+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp_zrvrx1c/worker_log.txt
[2020-10-07T18:02:12+0000] [MainThread] [I] [toil.leader] Job ended: 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-urqnk6gv
[2020-10-07T18:02:17+0000] [MainThread] [I] [toil.leader] Finished toil run successfully.
Traceback (most recent call last):
  File "/usr/local/bin/cactus-blast", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 88, in main
    runCactusBlastOnly(options)
  File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 188, in runCactusBlastOnly
    toil.exportFile(outWorkFlowArgs.alignmentsID, makeURL(options.outputFile))
UnboundLocalError: local variable 'outWorkFlowArgs' referenced before assignment
srun: error: node-8: task 0: Exited with exit code 1
glennhickey commented 3 years ago

Yes, this definitely looks like a bug in cactus-blast --restart. Thanks for reporting it! I'll take a look when I can.

On Wed, Oct 7, 2020 at 11:51 PM Brett Chapman notifications@github.com wrote:

I've managed to run cactus-blast with the config file (chunk size set to 50000000 instead of 25000000) and max_num_outgroups set to 2 instead of 3. The job seemed to be running ok for a few days, much longer than it initially did.

For some reason slurm restarted the job (my sbatch script restarts the job with --restart if it is rerun for some reason - I find this only works if slurm restarts the job, and it just fails with read-only file system errors if I manually restart using the same image file).

After slurm restarted the job it failed with the following error about a local variable 'outWorkFlowArgs' referenced before assignment. It's probably an error in relation to the restart or a toil specific error because of the restart. I'll try and rerun from the beginning again and hopefully it doesn't restart. Is my choice of chunk size and max_num_outgroups appropriate or perhaps I should change them?

Thanks.

srun -n 1 singularity exec --cleanenv --no-home --overlay jobStore-2113.img --bind /data/scratch/cactus-2113/tmp:/tmp /data/cactus_builds/cactus_latest.sif cactus-blast /cactus/jobStore /data/pangenome_cactus/steps-output/pangenome_guidetree.txt /data/pangenome_cactus/steps-output/Anc01.cigar --restart --configFile /data/pangenome_cactus/cactus_progressive_config.xml --root Anc01 --workDir=/cactus/workDir --retryCount 20 --defaultMemory 50G --defaultDisk 100G --binariesMode local [2020-10-07T04:05:54+0000] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-CactusTrimmingBlastPhase/instance-urqnk6gv [2020-10-07T04:05:54+0000] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-EncapsulatedJob/instance-30qba8ap [2020-10-07T04:06:17+0000] [MainThread] [I] [toil] Running Toil version 4.2.0-3aa1da130141039cb357efe36d7df9b9f6ae9b5b on host node-8. [2020-10-07T04:06:17+0000] [MainThread] [I] [toil.leader] Issued job 'mappingQualityRescoring' kind-mappingQualityRescoring/instance-4ofktay0 with job batch system ID: 0 and cores: 1, disk: 100.0 G, and memory: 50.0 G [2020-10-07T04:06:17+0000] [MainThread] [I] [toil.leader] Issued job 'updateExpWrapperForOutgroups' kind-updateExpWrapperForOutgroups/instance-1iox8_bn with job batch system ID: 1 and cores: 1, disk: 100.0 G, and memory: 50.0 G [2020-10-07T04:06:18+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmpi3ofzqjw/worker_log.txt [2020-10-07T04:06:19+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T05:06:20+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T06:06:21+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T07:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T08:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T09:06:22+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T10:06:23+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T11:06:23+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T12:06:24+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T13:06:24+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T14:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T15:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T16:06:25+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T17:06:26+0000] [MainThread] [I] [toil.leader] 1 jobs are running, 1 jobs are issued and waiting to run [2020-10-07T18:00:29+0000] [Thread-3 ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Input cigar file has 98721860 lines [2020-10-07T18:00:29+0000] [Thread-3 ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Filtered, non-overlapping primary cigar file has 137708381 lines [2020-10-07T18:00:29+0000] [Thread-3 ] [I] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Filtered, non-overlapping secondary cigar file has 530385538 lines [2020-10-07T18:00:29+0000] [Thread-3 ] [W] [toil.statsAndLogging] Got message from job at time 10-07-2020 18:00:29: Job used more disk than requested. Consider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job files/for-job/kind-CactusTrimmingBlastPhase/instance-urqnk6gv/cleanup/file-_y4h096w/stream used 118.98% (119.0 GB [127756722176B] used, 100.0 GB [107374182400B] requested) at the end of its run. [2020-10-07T18:00:44+0000] [MainThread] [I] [toil.leader] Job ended: 'mappingQualityRescoring' kind-mappingQualityRescoring/instance-4ofktay0 [2020-10-07T18:00:50+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp3aydvfxt/worker_log.txt [2020-10-07T18:00:57+0000] [MainThread] [I] [toil.leader] Job ended: 'updateExpWrapperForOutgroups' kind-updateExpWrapperForOutgroups/instance-1iox8_bn [2020-10-07T18:00:57+0000] [MainThread] [I] [toil.leader] Issued job 'Job' kind-Job/instance-t7d52rr4 with job batch system ID: 2 and cores: 0, disk: 100.0 M, and memory: 512.0 M [2020-10-07T18:00:58+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp17z_loex/worker_log.txt [2020-10-07T18:01:10+0000] [MainThread] [I] [toil.leader] Job ended: 'Job' kind-Job/instance-t7d52rr4 [2020-10-07T18:01:10+0000] [MainThread] [I] [toil.leader] Issued job 'EncapsulatedJob' kind-EncapsulatedJob/instance-30qba8ap with job batch system ID: 3 and cores: 1, disk: 100.0 G, and memory: 2.0 G [2020-10-07T18:01:11+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmpy9zmku__/worker_log.txt [2020-10-07T18:01:12+0000] [MainThread] [I] [toil.leader] Job ended: 'EncapsulatedJob' kind-EncapsulatedJob/instance-30qba8ap [2020-10-07T18:01:12+0000] [MainThread] [I] [toil.leader] Issued job 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-urqnk6gv with job batch system ID: 4 and cores: 1, disk: 100.0 G, and memory: 3.3 G [2020-10-07T18:01:14+0000] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/node-f0d68d4a-1c6b-4dc6-b35a-64ce702b8c10-0cfb37ca-24e9-48b5-b720-c14c5c841361/tmp_zrvrx1c/worker_log.txt [2020-10-07T18:02:12+0000] [MainThread] [I] [toil.leader] Job ended: 'CactusTrimmingBlastPhase' kind-CactusTrimmingBlastPhase/instance-urqnk6gv [2020-10-07T18:02:17+0000] [MainThread] [I] [toil.leader] Finished toil run successfully. Traceback (most recent call last): File "/usr/local/bin/cactus-blast", line 8, in sys.exit(main()) File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 88, in main runCactusBlastOnly(options) File "/usr/local/lib/python3.6/dist-packages/cactus/blast/cactus_blast.py", line 188, in runCactusBlastOnly toil.exportFile(outWorkFlowArgs.alignmentsID, makeURL(options.outputFile)) UnboundLocalError: local variable 'outWorkFlowArgs' referenced before assignment srun: error: node-8: task 0: Exited with exit code 1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/320#issuecomment-705312915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373UQK27ATSYGRLLHGYTSJUZMPANCNFSM4R73MDUQ .