Open brettChapman opened 8 months ago
I'm guessing this is vg index -j
running out of memory?
Possibly. It appeared to happen around the merging step, where the VG and HAL files are merged.
I'm looking into whether I can get the sys admins to add swap space or increase the RAM.
If you have a log, it shuold be possible to pinpoint. Some suggestions
export CACTUS_LOG_MEMORY=1
will add memory usage to your logs, which can be very handy (this really should be on by default)--indexMemory
to whatever your system limit is. This will make sure that no big jobs are run in parallel.--giraffe
or --haplo
options (if it's indeed crashing during the indexing). This will give you all the other outputs then you can try indexing by hand to maybe keep better track of your processes.Hi @glennhickey
I gave it another shot with those settings (except I left in giraffe and haplo), and I still get an error:
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] 0 jobs are running, 0 jobs are issued and waiting to run
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] Issued job 'Job' kind-Job/instance-xfebyj3y v1 with job batch system ID: 2 and disk: 2.7 Ti, memory: 1.8 Ti, cores: 1, accelerators: [], preemptible: False
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] Issued job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1 with job batch system ID: 3 and disk: 235.3 Gi, memory: 37.1 Gi, cores: 1, accelerators: [], preemptibl
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/6e73/worker_log.txt
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-127if45e v1 with job batch system ID: 4 and disk: 1.1 Ti, memory: 1.1 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-_64s88pm v1 with job batch system ID: 5 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-0e0fvqpz v1 with job batch system ID: 6 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-vtw9ntdn v1 with job batch system ID: 7 and disk: 1.2 Ti, memory: 1.2 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-9qp8df68 v1 with job batch system ID: 8 and disk: 1.2 Ti, memory: 1.2 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-jqktu2ez v1 with job batch system ID: 9 and disk: 893.3 Gi, memory: 893.3 Gi, cores: 1, accelerators: [], preemptible:
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-utpl1pif v1 with job batch system ID: 10 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fal
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/0def/worker_log.txt
[2024-03-13T12:05:36+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:05:36.995244: Running the command: "halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_
[2024-03-13T12:05:51+0800] [Thread-1 (daddy)] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker merge_hal file:/cactus/jobStore kind-merge_hal/instance-6l_6gqbb.
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
Exit reason: None
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] Log from job "kind-merge_hal/instance-6l_6gqbb" follows:
=========>
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil] Running Toil version 5.12.0-6d5a5b83b649cd8adf34a5cfe89e7690c95189d3 on host pnod1-21-1.
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Working on job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Loaded body Job('merge_hal' kind-merge_hal/instance-6l_6gqbb v1) from description 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
[2024-03-13T12:05:36+0800] [MainThread] [I] [cactus.shared.common] Running the command ['halMergeChroms', 'Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_V3_c
[2024-03-13T12:05:36+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:05:36.995244: Running the command: "halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.ha
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-a6b054c53ae14451943e0e629e446890/Morex_V3_chr1H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-48f6c9e3161a4316830f488551dddd66/Morex_V3_chr2H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-ca24c5ef50e8460fabf4a8a46bcfe67d/Morex_V3_chr3H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-18b0ee57688146e393c6ca0e3aee8583/Morex_V3_chr4H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-2728aaf4597746809b54c49fcf8011a0/Morex_V3_chr5H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-bf09596bb320440291325c87d16eef16/Morex_V3_chr6H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
[2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-1029f599b85b4f9d8c0107d858279570/Morex_V3_chr7H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
Traceback (most recent call last):
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 403, in workerScript
job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2774, in _runner
returnValues = self._run(jobGraph=None, fileStore=fileStore)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2691, in _run
return self.run(fileStore)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2919, in run
rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 967, in merge_hal
cactus_call(parameters=cmd, work_dir = work_dir, job_memory=job.memory)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/shared/common.py", line 888, in cactus_call
raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_V3_chr6H.hal,Morex_V3_ch
terminate called after throwing an instance of 'hal_exception'
what(): Duplicate sequence name found: _MINIGRAPH_.s39791
Command terminated by signal 6
CACTUS-LOGGED-MEMORY-IN-KB: 290620
The job still continues on after the error:
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 with ID kind-merge_hal/instance-6l_6gqbb to 1
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] We have increased the default memory of the failed job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 to 2005000000000 bytes
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] We have increased the disk of the failed job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 to the default of 3000000000000 bytes
[2024-03-13T12:05:51+0800] [MainThread] [I] [toil.leader] Issued job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v3 with job batch system ID: 11 and disk: 2.7 Ti, memory: 1.8 Ti, cores: 1, accelerators: [], preemptible:
[2024-03-13T12:05:51+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/worker_log.txt
[2024-03-13T12:07:47+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:07:47.550961: Running the command: "vg convert -W -f /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg"
[2024-03-13T12:19:22+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:19:22.979812: Successfully ran: "vg convert -W -f /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg" in 695.428 s
[2024-03-13T12:19:22+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:19:22.980060: Running the command: "gfaffix /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg.gfa --output_refine
[2024-03-13T13:00:25+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 7 jobs are issued and waiting to run
It looks like it had the signal 6 error after trying to run vg_clip job. I assume that would be the step when clip is applied for giraffe and haplo. If the job fails again, I'll remove those parameters. How can I go about getting the giraffe and haplo indexes after the run by hand?
This error doesn't have anything to do with memory:
what(): Duplicate sequence name found: _MINIGRAPH_.s39791
Command terminated by signal 6
It's because it's getting the same chromosome twice in the input. I'm pretty sure you've had this exact same problem before (though I'm too lazy to dig up the exact issue now).
It's surely because of the double wildcard in your command line, ex:
*H/barley-pg/*.hal
that is pulling multiple sets of chromosome graphs into graphmap join. This should be pretty clear at the beginning of the log where it prints out all the input chromosome graphs that it's loading.
Hi @glennhickey Thanks. I do remember that error. I moved the .vg and .hal files to a central location and have reran them. Hopefully that resolves the issue and the memory issue I receive later.
I still got the same error. I think it might be the fact I'm supplying both VG and HAL of the same graph. I'll try with VG graph only and see if that resolves it.
Supplying only the VG files resolved the problem.
However I still run out of memory later:
[2024-03-15T01:02:35+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:02:35.052138: Successfully ran: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/1354/92c7/tmpx8x6cptu/Morex_V3_chr4H.vg.cl
[2024-03-15T01:04:53+0800] [MainThread] [I] [toil.leader] Issued job 'vg_to_og' kind-vg_to_og/instance-xaoedjw8 v1 with job batch system ID: 9 and disk: 847.6 Gi, memory: 1.7 Ti, cores: 1, accelerators: [
[2024-03-15T01:04:54+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/worker_log.txt
[2024-03-15T01:06:52+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:06:52.916068: Running the command: "vg convert -W -f /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr
[2024-03-15T01:22:07+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:22:07.600170: Successfully ran: "vg convert -W -f /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.
[2024-03-15T01:22:07+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:22:07.600381: Running the command: "gfaffix /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa
[2024-03-15T01:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T02:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T03:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T04:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T05:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.113011: Successfully ran: "gfaffix /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa --
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.113144: Running the command: "head -1 /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.117224: Successfully ran: "head -1 /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa" i
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.117316: Running the command: "sed -i /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfaf
[2024-03-15T05:41:40+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:41:40.154552: Successfully ran: "sed -i /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfaffix
[2024-03-15T05:41:40+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:41:40.154838: Running the command: "bash -c set -eo pipefail && vg convert -g -p /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e
[2024-03-15T06:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T07:30:32+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T08:30:32+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T09:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T10:04:30+0800] [MainThread] [I] [toil-rt] 2024-03-15 10:04:30.810634: Successfully ran: "bash -c 'set -eo pipefail && vg convert -g -p /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e
[2024-03-15T10:04:30+0800] [MainThread] [I] [toil-rt] 2024-03-15 10:04:30.810945: Running the command: "bash -c set -eo pipefail && clip-vg /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjf
[2024-03-15T10:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T11:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T11:58:39+0800] [MainThread] [I] [toil-rt] 2024-03-15 11:58:39.785792: Successfully ran: "bash -c 'set -eo pipefail && clip-vg /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk
[2024-03-15T11:58:39+0800] [MainThread] [I] [toil-rt] 2024-03-15 11:58:39.785961: Running the command: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg
[2024-03-15T12:30:34+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T13:10:51+0800] [MainThread] [I] [toil-rt] 2024-03-15 13:10:51.833904: Successfully ran: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.cl
[2024-03-15T13:14:39+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/7711/worker_log.txt
[2024-03-15T13:14:40+0800] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2024-03-15T13:14:40+0800] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
[2024-03-15T13:14:51+0800] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/cactus/jobStore)
Traceback (most recent call last):
File "/home/cactus/cactus_env/bin/cactus-graphmap-join", line 8, in <module>
sys.exit(main())
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 102, in main
graphmap_join(options)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 323, in graphmap_join
wf_output = toil.start(Job.wrapJobFn(graphmap_join_workflow, options, config, vg_ids, hal_ids))
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1064, in start
return self._runMainLoop(rootJobDescription)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1544, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 251, in run
self.innerLoop()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 741, in innerLoop
self._processReadyJobs()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 636, in _processReadyJobs
self._processReadyJob(message.job_id, message.result_status)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 552, in _processReadyJob
self._runJobSuccessors(job_id)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 442, in _runJobSuccessors
self.issueJobs(successors)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 919, in issueJobs
self.issueJob(job)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 896, in issueJob
jobBatchSystemID = self.batchSystem.issueBatchJob(jobNode, job_environment=job_environment)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 755, in issueBatchJob
self.check_resource_request(scaled_desc)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 506, in check_resource_request
raise e
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 502, in check_resource_request
super().check_resource_request(requirer)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 344, in check_resource_request
raise e
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 337, in check_resource_request
raise InsufficientSystemResources(requirer, resource, available)
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'vg_to_og' kind-vg_to_og/instance-j0i80ayb v1 is requesting 2079546619040 bytes of memory, more than the maximum of 2005000000000 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.0.
Supplying only the VG files resolved the problem.
Right, the problem was a duplicate HAL file in the input.
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'vg_to_og' kind-vg_to_og/instance-j0i80ayb v1 is requesting 2079546619040 bytes of memory, more than the maximum of 2005000000000 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.0.
Which Cactus version are you using? This type of error, where Cactus asks for more memory than you have, shouldn't happen in the latest version. In any case you should be able to resolve it by specifying --indexMemory
to cap the amount Cactus ever asks for by the given amount. In very recent Cactus versions you may be able to get away with --restart --maxMemory 2000000000000
to fix this without restarting from scratch.
@glennhickey I'm using a version from November. I'll try a newer version with those parameters and see how I go.
Trying with the latest Cactus I get this error:
File "/home/cactus/cactus_env/bin/cactus-graphmap-join", line 8, in <module>
sys.exit(main())
File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 62, in main
Job.Runner.addToilOptions(parser)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2194, in addToilOptions
addOptions(parser, jobstore_as_flag=jobstore_as_flag)
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 645, in addOptions
check_and_create_default_config_file()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 460, in check_and_create_default_config_file
check_and_create_toil_home_dir()
File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 445, in check_and_create_toil_home_dir
raise RuntimeError(f"Cannot create or access Toil configuration directory {TOIL_HOME_DIR}")
RuntimeError: Cannot create or access Toil configuration directory /home/murdoch_brettc/.toil
I think one of the default paths must have changed with Toil. Never got this error before. Can I specify a path for this?
That's a new one for me. Looking at the exception, perhaps setting TOIL_HOME_DIR
can change it. @adamnovak any idea what's going on here?
TOIL_HOME_DIR is just a Toil constant, not an environment variable.
It looks like Toil is getting a path for ~
that it can't actually use:
I think HOME
is set to /home/murdoch_brettc
in an environment where really it would need to be /home/cactus
.
Maybe the problem is the --no-home
on that Singularity command? Or else not clearing out/properly setting HOME
when creating that Singularity container?
I managed to get around the problem by removing --no-home
and setting -H cactus/tmp/
directory. The problem before was it was trying to write to a sub folder which didn't have write permissions.
Hi
I've ran the final join step on a large 76 pangenome graph with a maximum of 2TB RAM resource compute node and the job failed due to memory limits going beyond 2TB, to around 2.1TB. Is there a way to tweak the settings, perhaps use toil to distribute the compute resources, or do I simply need to add more RAM or swap space?
My join command:
Thanks.