Failed join job due to memory request beyond maximum allowed

brettChapman commented 8 months ago

Hi

I've ran the final join step on a large 76 pangenome graph with a maximum of 2TB RAM resource compute node and the job failed due to memory limits going beyond 2TB, to around 2.1TB. Is there a way to tweak the settings, perhaps use toil to distribute the compute resources, or do I simply need to add more RAM or swap space?

My join command:

singularity exec --cleanenv \
                       --no-home \
                       --overlay ${JOBSTORE_IMAGE} \
                       --bind ${CACTUS_SCRATCH}/tmp:/tmp \
                       ${CACTUS_IMAGE} cactus-graphmap-join /cactus/jobStore --indexCores 1 --vg ${original_folder}/*H/barley-pg/*.vg --configFile ${CONFIG_FILE} --hal ${original_folder}/*H/barley-pg/*.hal --outDir ${original_folder}/barley-pg --outName barley-pg --reference ${REFERENCE} --haplo clip --giraffe clip --chrom-vg --chrom-og --gbz --gfa --viz --draw --disableCaching --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 3000G --maxDisk 3000G --maxCores 1 --maxMemory 2010G --defaultMemory 2010G

Thanks.

glennhickey commented 8 months ago

I'm guessing this is vg index -j running out of memory?

brettChapman commented 8 months ago

Possibly. It appeared to happen around the merging step, where the VG and HAL files are merged.

I'm looking into whether I can get the sys admins to add swap space or increase the RAM.

glennhickey commented 8 months ago

If you have a log, it shuold be possible to pinpoint. Some suggestions

export CACTUS_LOG_MEMORY=1 will add memory usage to your logs, which can be very handy (this really should be on by default)
Jack up --indexMemory to whatever your system limit is. This will make sure that no big jobs are run in parallel.
Don't use --giraffe or --haplo options (if it's indeed crashing during the indexing). This will give you all the other outputs then you can try indexing by hand to maybe keep better track of your processes.

brettChapman commented 8 months ago

Hi @glennhickey

I gave it another shot with those settings (except I left in giraffe and haplo), and I still get an error:

[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] 0 jobs are running, 0 jobs are issued and waiting to run
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] Issued job 'Job' kind-Job/instance-xfebyj3y v1 with job batch system ID: 2 and disk: 2.7 Ti, memory: 1.8 Ti, cores: 1, accelerators: [], preemptible: False
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.leader] Issued job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1 with job batch system ID: 3 and disk: 235.3 Gi, memory: 37.1 Gi, cores: 1, accelerators: [], preemptibl
[2024-03-13T12:00:24+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/6e73/worker_log.txt
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-127if45e v1 with job batch system ID: 4 and disk: 1.1 Ti, memory: 1.1 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-_64s88pm v1 with job batch system ID: 5 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-0e0fvqpz v1 with job batch system ID: 6 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-vtw9ntdn v1 with job batch system ID: 7 and disk: 1.2 Ti, memory: 1.2 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-9qp8df68 v1 with job batch system ID: 8 and disk: 1.2 Ti, memory: 1.2 Ti, cores: 1, accelerators: [], preemptible: Fals
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-jqktu2ez v1 with job batch system ID: 9 and disk: 893.3 Gi, memory: 893.3 Gi, cores: 1, accelerators: [], preemptible: 
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.leader] Issued job 'clip_vg' kind-clip_vg/instance-utpl1pif v1 with job batch system ID: 10 and disk: 1.0 Ti, memory: 1.0 Ti, cores: 1, accelerators: [], preemptible: Fal
[2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/0def/worker_log.txt
[2024-03-13T12:05:36+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:05:36.995244: Running the command: "halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_
[2024-03-13T12:05:51+0800] [Thread-1 (daddy)] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker merge_hal file:/cactus/jobStore kind-merge_hal/instance-6l_6gqbb.
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
Exit reason: None
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.leader] Log from job "kind-merge_hal/instance-6l_6gqbb" follows:
=========>
        [2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2024-03-13T12:00:25+0800] [MainThread] [I] [toil] Running Toil version 5.12.0-6d5a5b83b649cd8adf34a5cfe89e7690c95189d3 on host pnod1-21-1.
        [2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Working on job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
        [2024-03-13T12:00:25+0800] [MainThread] [I] [toil.worker] Loaded body Job('merge_hal' kind-merge_hal/instance-6l_6gqbb v1) from description 'merge_hal' kind-merge_hal/instance-6l_6gqbb v1
        [2024-03-13T12:05:36+0800] [MainThread] [I] [cactus.shared.common] Running the command ['halMergeChroms', 'Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_V3_c
        [2024-03-13T12:05:36+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:05:36.995244: Running the command: "halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.ha
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-a6b054c53ae14451943e0e629e446890/Morex_V3_chr1H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-48f6c9e3161a4316830f488551dddd66/Morex_V3_chr2H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-ca24c5ef50e8460fabf4a8a46bcfe67d/Morex_V3_chr3H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-18b0ee57688146e393c6ca0e3aee8583/Morex_V3_chr4H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-2728aaf4597746809b54c49fcf8011a0/Morex_V3_chr5H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-bf09596bb320440291325c87d16eef16/Morex_V3_chr6H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        [2024-03-13T12:05:39+0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-1029f599b85b4f9d8c0107d858279570/Morex_V3_chr7H.hal' to path '/cactus/workDir/5d97721bbaf95e819e1
        Traceback (most recent call last):
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/worker.py", line 403, in workerScript
            job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2774, in _runner
            returnValues = self._run(jobGraph=None, fileStore=fileStore)
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2691, in _run
            return self.run(fileStore)
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2919, in run
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 967, in merge_hal
            cactus_call(parameters=cmd, work_dir = work_dir, job_memory=job.memory)
          File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/shared/common.py", line 888, in cactus_call
            raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
        RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" halMergeChroms Morex_V3_chr1H.hal,Morex_V3_chr2H.hal,Morex_V3_chr3H.hal,Morex_V3_chr4H.hal,Morex_V3_chr5H.hal,Morex_V3_chr6H.hal,Morex_V3_ch
        terminate called after throwing an instance of 'hal_exception'
          what():  Duplicate sequence name found: _MINIGRAPH_.s39791
        Command terminated by signal 6
        CACTUS-LOGGED-MEMORY-IN-KB: 290620

The job still continues on after the error:

[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 with ID kind-merge_hal/instance-6l_6gqbb to 1
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] We have increased the default memory of the failed job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 to 2005000000000 bytes
[2024-03-13T12:05:51+0800] [MainThread] [W] [toil.job] We have increased the disk of the failed job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v2 to the default of 3000000000000 bytes
[2024-03-13T12:05:51+0800] [MainThread] [I] [toil.leader] Issued job 'merge_hal' kind-merge_hal/instance-6l_6gqbb v3 with job batch system ID: 11 and disk: 2.7 Ti, memory: 1.8 Ti, cores: 1, accelerators: [], preemptible:
[2024-03-13T12:05:51+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/worker_log.txt
[2024-03-13T12:07:47+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:07:47.550961: Running the command: "vg convert -W -f /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg"
[2024-03-13T12:19:22+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:19:22.979812: Successfully ran: "vg convert -W -f /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg" in 695.428 s
[2024-03-13T12:19:22+0800] [MainThread] [I] [toil-rt] 2024-03-13 12:19:22.980060: Running the command: "gfaffix /cactus/workDir/5d97721bbaf95e819e135c3c157f2c89/3163/d05b/tmp3b3s18os/Morex_V3_chr1H.vg.gfa --output_refine
[2024-03-13T13:00:25+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 7 jobs are issued and waiting to run

It looks like it had the signal 6 error after trying to run vg_clip job. I assume that would be the step when clip is applied for giraffe and haplo. If the job fails again, I'll remove those parameters. How can I go about getting the giraffe and haplo indexes after the run by hand?

glennhickey commented 8 months ago

This error doesn't have anything to do with memory:

        what():  Duplicate sequence name found: _MINIGRAPH_.s39791
        Command terminated by signal 6

It's because it's getting the same chromosome twice in the input. I'm pretty sure you've had this exact same problem before (though I'm too lazy to dig up the exact issue now).

It's surely because of the double wildcard in your command line, ex:

*H/barley-pg/*.hal

that is pulling multiple sets of chromosome graphs into graphmap join. This should be pretty clear at the beginning of the log where it prints out all the input chromosome graphs that it's loading.

brettChapman commented 8 months ago

Hi @glennhickey Thanks. I do remember that error. I moved the .vg and .hal files to a central location and have reran them. Hopefully that resolves the issue and the memory issue I receive later.

brettChapman commented 8 months ago

I still got the same error. I think it might be the fact I'm supplying both VG and HAL of the same graph. I'll try with VG graph only and see if that resolves it.

brettChapman commented 8 months ago

Supplying only the VG files resolved the problem.

However I still run out of memory later:

[2024-03-15T01:02:35+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:02:35.052138: Successfully ran: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/1354/92c7/tmpx8x6cptu/Morex_V3_chr4H.vg.cl
[2024-03-15T01:04:53+0800] [MainThread] [I] [toil.leader] Issued job 'vg_to_og' kind-vg_to_og/instance-xaoedjw8 v1 with job batch system ID: 9 and disk: 847.6 Gi, memory: 1.7 Ti, cores: 1, accelerators: [
[2024-03-15T01:04:54+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/worker_log.txt
[2024-03-15T01:06:52+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:06:52.916068: Running the command: "vg convert -W -f /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr
[2024-03-15T01:22:07+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:22:07.600170: Successfully ran: "vg convert -W -f /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.
[2024-03-15T01:22:07+0800] [MainThread] [I] [toil-rt] 2024-03-15 01:22:07.600381: Running the command: "gfaffix /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa
[2024-03-15T01:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T02:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T03:30:30+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T04:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T05:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.113011: Successfully ran: "gfaffix /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa --
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.113144: Running the command: "head -1 /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.117224: Successfully ran: "head -1 /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfa" i
[2024-03-15T05:38:59+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:38:59.117316: Running the command: "sed -i /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfaf
[2024-03-15T05:41:40+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:41:40.154552: Successfully ran: "sed -i /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.gfaffix
[2024-03-15T05:41:40+0800] [MainThread] [I] [toil-rt] 2024-03-15 05:41:40.154838: Running the command: "bash -c set -eo pipefail && vg convert -g -p /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e
[2024-03-15T06:30:31+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T07:30:32+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T08:30:32+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T09:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T10:04:30+0800] [MainThread] [I] [toil-rt] 2024-03-15 10:04:30.810634: Successfully ran: "bash -c 'set -eo pipefail && vg convert -g -p /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e
[2024-03-15T10:04:30+0800] [MainThread] [I] [toil-rt] 2024-03-15 10:04:30.810945: Running the command: "bash -c set -eo pipefail && clip-vg /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjf
[2024-03-15T10:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T11:30:33+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T11:58:39+0800] [MainThread] [I] [toil-rt] 2024-03-15 11:58:39.785792: Successfully ran: "bash -c 'set -eo pipefail && clip-vg /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk
[2024-03-15T11:58:39+0800] [MainThread] [I] [toil-rt] 2024-03-15 11:58:39.785961: Running the command: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg
[2024-03-15T12:30:34+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 6 jobs are issued and waiting to run
[2024-03-15T13:10:51+0800] [MainThread] [I] [toil-rt] 2024-03-15 13:10:51.833904: Successfully ran: "vg validate /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/6cc3/e6e8/tmpjffk3_x1/Morex_V3_chr7H.vg.cl
[2024-03-15T13:14:39+0800] [MainThread] [I] [toil.worker] Redirecting logging to /cactus/workDir/10203ebb8e385f76afaa07b3031c3b10/7711/worker_log.txt
[2024-03-15T13:14:40+0800] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2024-03-15T13:14:40+0800] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
[2024-03-15T13:14:51+0800] [MainThread] [I] [toil.common] Successfully deleted the job store: FileJobStore(/cactus/jobStore)
Traceback (most recent call last):
  File "/home/cactus/cactus_env/bin/cactus-graphmap-join", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 102, in main
    graphmap_join(options)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 323, in graphmap_join
    wf_output = toil.start(Job.wrapJobFn(graphmap_join_workflow, options, config, vg_ids, hal_ids))
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1064, in start
    return self._runMainLoop(rootJobDescription)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 1544, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 251, in run
    self.innerLoop()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 741, in innerLoop
    self._processReadyJobs()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 636, in _processReadyJobs
    self._processReadyJob(message.job_id, message.result_status)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 552, in _processReadyJob
    self._runJobSuccessors(job_id)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 442, in _runJobSuccessors
    self.issueJobs(successors)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 919, in issueJobs
    self.issueJob(job)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/leader.py", line 896, in issueJob
    jobBatchSystemID = self.batchSystem.issueBatchJob(jobNode, job_environment=job_environment)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 755, in issueBatchJob
    self.check_resource_request(scaled_desc)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 506, in check_resource_request
    raise e
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/singleMachine.py", line 502, in check_resource_request
    super().check_resource_request(requirer)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 344, in check_resource_request
    raise e
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 337, in check_resource_request
    raise InsufficientSystemResources(requirer, resource, available)
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'vg_to_og' kind-vg_to_og/instance-j0i80ayb v1 is requesting 2079546619040 bytes of memory, more than the maximum of 2005000000000 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.0.

glennhickey commented 8 months ago

Supplying only the VG files resolved the problem.

Right, the problem was a duplicate HAL file in the input.

toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'vg_to_og' kind-vg_to_og/instance-j0i80ayb v1 is requesting 2079546619040 bytes of memory, more than the maximum of 2005000000000 bytes of memory that SingleMachineBatchSystem was configured with, or enforced by --maxMemory. Scale is set to 1.0.

Which Cactus version are you using? This type of error, where Cactus asks for more memory than you have, shouldn't happen in the latest version. In any case you should be able to resolve it by specifying --indexMemory to cap the amount Cactus ever asks for by the given amount. In very recent Cactus versions you may be able to get away with --restart --maxMemory 2000000000000 to fix this without restarting from scratch.

brettChapman commented 8 months ago

@glennhickey I'm using a version from November. I'll try a newer version with those parameters and see how I go.

brettChapman commented 8 months ago

Trying with the latest Cactus I get this error:

File "/home/cactus/cactus_env/bin/cactus-graphmap-join", line 8, in <module>
    sys.exit(main())
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap_join.py", line 62, in main
    Job.Runner.addToilOptions(parser)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/job.py", line 2194, in addToilOptions
    addOptions(parser, jobstore_as_flag=jobstore_as_flag)
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 645, in addOptions
    check_and_create_default_config_file()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 460, in check_and_create_default_config_file
    check_and_create_toil_home_dir()
  File "/home/cactus/cactus_env/lib/python3.10/site-packages/toil/common.py", line 445, in check_and_create_toil_home_dir
    raise RuntimeError(f"Cannot create or access Toil configuration directory {TOIL_HOME_DIR}")
RuntimeError: Cannot create or access Toil configuration directory /home/murdoch_brettc/.toil

I think one of the default paths must have changed with Toil. Never got this error before. Can I specify a path for this?

glennhickey commented 8 months ago

That's a new one for me. Looking at the exception, perhaps setting TOIL_HOME_DIR can change it. @adamnovak any idea what's going on here?

adamnovak commented 8 months ago

TOIL_HOME_DIR is just a Toil constant, not an environment variable.

It looks like Toil is getting a path for ~ that it can't actually use:

https://github.com/DataBiosphere/toil/blob/615dacc83cd77812d87e5fce79742a2c6b038a5e/src/toil/common.py#L106

I think HOME is set to /home/murdoch_brettc in an environment where really it would need to be /home/cactus.

Maybe the problem is the --no-home on that Singularity command? Or else not clearing out/properly setting HOME when creating that Singularity container?

brettChapman commented 8 months ago

I managed to get around the problem by removing --no-home and setting -H cactus/tmp/ directory. The problem before was it was trying to write to a sub folder which didn't have write permissions.

ComparativeGenomicsToolkit / cactus

Failed join job due to memory request beyond maximum allowed #1306