Open Han-Cao opened 6 months ago
An update on this issue:
I tried to submit the job directly to SLURM and not using --batchSystem slurm
in toil, it has enough memory to finish filter_paf_deletions
but raise another error. I will re-run the whole step to see if this can be fixed.
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Enabling realtime logging in Toil
[2024-05-13T10:38:40+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:38:40+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method. Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/bin/cactus-graphmap cactus-jobstore/ input/seqfile.txt cactus-pangenome//pangenome.sv.gfa cactus-pangenome//pangenome.paf --restart --outputFasta cactus-pangenome//pangenome.sv.gfa.fa.gz --outputGAFDir cactus-pangenome//pangenome.gaf --reference CHM13 GRCh38 --batchLogsDir log/cactus-pangenome/cactus-graphmap/ --workDir tmp/ --coordinationDir /tmp/ --mapCores 32 --maxJobs 32
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 2126683265e419d94b3efaa2f724c9f80024c3ad
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil] Running Toil version 6.1.0-3f9cba3766e52866ea80d0934498f8c8f3129c3f on host cpu05.hpc.cluster.
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging.
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.leader] Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v11 with job batch system ID: 1 and disk: 27.3 Gi, memory: 54.6 Gi, cores: 32, accelerators: [], preemptible: False
[2024-05-13T10:38:42+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-05-13T10:42:17+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:42:17.336709: Running the command: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans"
[2024-05-13T10:49:11+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:49:11.454337: Successfully ran: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans" in 414.0942 seconds with job-memory 54.6 Gi
[2024-05-13T10:49:11+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:49:11.466298: Running the command: "filter-paf-deletions tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.vg tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf -d 10000000 -v -p -t 32 -m 0.01 -s 2.0"
[2024-05-13T11:12:27+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:12:27.006658: Successfully ran: "filter-paf-deletions tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.vg tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf -d 10000000 -v -p -t 32 -m 0.01 -s 2.0" in 1395.4663 seconds with job-memory 54.6 Gi
[2024-05-13T11:38:44+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-05-13T11:49:27+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:27.837946: Running the command: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf"
[2024-05-13T11:49:59+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:59.001481: Successfully ran: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf" in 31.1111 seconds
[2024-05-13T11:49:59+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:59.002769: Running the command: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf.filter"
[2024-05-13T11:50:25+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:50:25.411969: Successfully ran: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf.filter" in 26.3988 seconds
[2024-05-13T11:53:40+0800] [Thread-4 (statsAndLoggingAggregator)] [W] [toil.statsAndLogging] Got message from job at time 05-13-2024 11:53:40: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-filter_paf_deletions/instance-dnvkm07m/cleanup/file-6430b5c2ab4a43eca872868ce6bf923c/stream used 180.30% disk (49.2 GiB [52811338752B] used, 27.3 GiB [29290853784B] requested).
[2024-05-13T11:54:07+0800] [MainThread] [I] [toil.leader] Finished toil run successfully.
[2024-05-13T11:54:07+0800] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2024-05-13T11:54:08+0800] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
Traceback (most recent call last):
File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/bin/cactus-graphmap", line 8, in <module>
sys.exit(main())
File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap.py", line 99, in main
graph_map(options)
File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap.py", line 112, in graph_map
paf_id, gfa_fa_id, gaf_id, unfiltered_paf_id, paf_filter_log = toil.restart()
ValueError: too many values to unpack (expected 5)
OK, I think you've identified three issues
1) Cactus is not asking Toil for enough memory for filter-paf-deletions
for this dataset
2) It is difficult to change options when running with --restart
3) --restart
totally fails with cactus-graphmap
For 1. and 2., I think your best bet is to run with --doubleMem true
. This should in theory restart jobs that run out of memory with twice the memory. I don't see any code changes since v2.7.2, so the different memory values must be due to differences in your data. If you do end up running again, can you please set export CACTUS_LOG_MEMORY=1
first? Then the log will report how much memory it actually used, and if you tell me I can try to revise the estimate.
For 3., this is definitely a bug, where the --restart
flag was broken for cactus-graphmap
. I've fixed it here https://github.com/ComparativeGenomicsToolkit/cactus/pull/1384. Thanks for reporting it!
Thanks for your suggestions. I have re-run the whole workflow with the latest cactus_graphmap.py, it works now.
I tried to add export CACTUS_LOG_MEMORY=1
, but it fails as the computation node of our HPC cannot access /usr/bin/time
. I write a simple script to monitor the memory usage. As I occupy the whole node for filter-paf-deletions
, the total memory usage should be very close to the usage of filter-paf-deletions
while true; do usage=$(free -g | awk 'FNR==2{print $3}') ; [[ $usage -gt 50 ]] && echo $usage; sleep 1s; done
The maximum memory of is 85G according to the above script.
Great, thanks for reporting the actual memory usage!! I've adjusted the memory requirement in #1387 accordingly.
Hi,
I am running the step-by-step pangenome pipeline and failed at
filter_paf_deletions
incactus-graphmap
Based on this log
Exit reason: MEMLIMIT
, it should be caused by insufficient memory when submitting to SLURMIssued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v7 with job batch system ID: 1 and disk: 27.3 Gi, memory: 54.6 Gi, cores: 32, accelerators: [], preemptible: False
.Is it a bug in the latest version of cactus? I have successfully run the pipeline on a smaller sample size using cactus 2.7.2, it allocated even more memory:
Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-00kev09_ v1 with job batch system ID: 7 and disk: 61.2 Gi, memory: 122.4 Gi, cores: 8, accelerators: [], preemptible: False
By the way, is there any method to force allocate the memory when
--restart
? I try some toil parameters, but it seems they only affect default/max memory settings. I also try to specify inTOIL_SLURM_ARGS
, but I getValueError: Some resource arguments are incompatible: -p cpu --mem 383000M
Full log:
Thank you