ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
528 stars 111 forks source link

filter_paf_deletions request insufficient memory #1383

Open Han-Cao opened 6 months ago

Han-Cao commented 6 months ago

Hi,

I am running the step-by-step pangenome pipeline and failed at filter_paf_deletions in cactus-graphmap

Based on this log Exit reason: MEMLIMIT, it should be caused by insufficient memory when submitting to SLURM Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v7 with job batch system ID: 1 and disk: 27.3 Gi, memory: 54.6 Gi, cores: 32, accelerators: [], preemptible: False.

Is it a bug in the latest version of cactus? I have successfully run the pipeline on a smaller sample size using cactus 2.7.2, it allocated even more memory: Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-00kev09_ v1 with job batch system ID: 7 and disk: 61.2 Gi, memory: 122.4 Gi, cores: 8, accelerators: [], preemptible: False

By the way, is there any method to force allocate the memory when --restart? I try some toil parameters, but it seems they only affect default/max memory settings. I also try to specify in TOIL_SLURM_ARGS, but I get ValueError: Some resource arguments are incompatible: -p cpu --mem 383000M

Full log:

[2024-05-13T10:08:10+0800] [MainThread] [I] [toil.statsAndLogging] Enabling realtime logging in Toil
[2024-05-13T10:08:10+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:08:10+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:08:10+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/bin/cactus-graphmap cactus-jobstore/ input/seqfile.txt cactus-pangenome//pangenome.sv.gfa cactus-pangenome//pangenome.paf --restart --outputFasta cactus-pangenome//pangenome.sv.gfa.fa.gz --outputGAFDir cactus-pangenome//pangenome.gaf --reference CHM13 GRCh38 --batchLogsDir log/cactus-pangenome/cactus-graphmap/ --batchSystem slurm --caching=false --workDir tmp/ --coordinationDir /tmp/ --mapCores 1 --maxJobs 32
[2024-05-13T10:08:10+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 2126683265e419d94b3efaa2f724c9f80024c3ad
[2024-05-13T10:08:10+0800] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-minigraph_workflow/instance-1vrxigr9
[2024-05-13T10:08:10+0800] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-merge_pafs/instance-65q6vdsc
[2024-05-13T10:08:11+0800] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-minigraph_map_all/instance-xcx9tivl
[2024-05-13T10:08:11+0800] [MainThread] [C] [toil.jobStores.abstractJobStore] Repairing job: kind-filter_paf_deletions/instance-dnvkm07m
[2024-05-13T10:08:11+0800] [MainThread] [I] [toil] Running Toil version 6.1.0-3f9cba3766e52866ea80d0934498f8c8f3129c3f on host mas01.hpc.cluster.
[2024-05-13T10:08:11+0800] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging.
[2024-05-13T10:08:12+0800] [MainThread] [I] [toil.leader] Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v7 with job batch system ID: 1 and disk: 27.3 Gi, memory: 54.6 Gi, cores: 32, accelerators: [], preemptible: False
[2024-05-13T10:08:14+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-05-13T10:12:21+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:12:21.871181: Running the command: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa.trans"
[2024-05-13T10:19:17+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:19:17.752652: Successfully ran: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa.trans" in 415.8537 seconds with job-memory 54.6 Gi
[2024-05-13T10:19:17+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:19:17.764336: Running the command: "filter-paf-deletions tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa.vg tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.gfa.trans tmp/toilwf-8488c232b791512ea35a47e0f5723491/91ad/08ba/tmp921_nl1d/mg.paf -d 10000000 -v -p -t 32 -m 0.01 -s 2.0"
[2024-05-13T10:25:59+0800] [MainThread] [W] [toil.leader] Job failed with exit value 253: 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v7
Exit reason: MEMLIMIT

Thank you

Han-Cao commented 6 months ago

An update on this issue:

I tried to submit the job directly to SLURM and not using --batchSystem slurm in toil, it has enough memory to finish filter_paf_deletions but raise another error. I will re-run the whole step to see if this can be fixed.

[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Enabling realtime logging in Toil
[2024-05-13T10:38:40+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:38:40+0800] [MainThread] [W] [toil.lib.humanize] Deprecated toil method.  Please use "toil.lib.conversions.human2bytes()" instead."
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/bin/cactus-graphmap cactus-jobstore/ input/seqfile.txt cactus-pangenome//pangenome.sv.gfa cactus-pangenome//pangenome.paf --restart --outputFasta cactus-pangenome//pangenome.sv.gfa.fa.gz --outputGAFDir cactus-pangenome//pangenome.gaf --reference CHM13 GRCh38 --batchLogsDir log/cactus-pangenome/cactus-graphmap/ --workDir tmp/ --coordinationDir /tmp/ --mapCores 32 --maxJobs 32
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 2126683265e419d94b3efaa2f724c9f80024c3ad
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil] Running Toil version 6.1.0-3f9cba3766e52866ea80d0934498f8c8f3129c3f on host cpu05.hpc.cluster.
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging.
[2024-05-13T10:38:40+0800] [MainThread] [I] [toil.leader] Issued job 'filter_paf_deletions' kind-filter_paf_deletions/instance-dnvkm07m v11 with job batch system ID: 1 and disk: 27.3 Gi, memory: 54.6 Gi, cores: 32, accelerators: [], preemptible: False
[2024-05-13T10:38:42+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-05-13T10:42:17+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:42:17.336709: Running the command: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans"
[2024-05-13T10:49:11+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:49:11.454337: Successfully ran: "vg convert -r 0 -g tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa -p -T tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans" in 414.0942 seconds with job-memory 54.6 Gi
[2024-05-13T10:49:11+0800] [MainThread] [I] [toil-rt] 2024-05-13 10:49:11.466298: Running the command: "filter-paf-deletions tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.vg tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf -d 10000000 -v -p -t 32 -m 0.01 -s 2.0"
[2024-05-13T11:12:27+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:12:27.006658: Successfully ran: "filter-paf-deletions tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.vg tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.gfa.trans tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf -d 10000000 -v -p -t 32 -m 0.01 -s 2.0" in 1395.4663 seconds with job-memory 54.6 Gi
[2024-05-13T11:38:44+0800] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-05-13T11:49:27+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:27.837946: Running the command: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf"
[2024-05-13T11:49:59+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:59.001481: Successfully ran: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf" in 31.1111 seconds
[2024-05-13T11:49:59+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:49:59.002769: Running the command: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf.filter"
[2024-05-13T11:50:25+0800] [MainThread] [I] [toil-rt] 2024-05-13 11:50:25.411969: Successfully ran: "wc -l tmp/toilwf-8488c232b791512ea35a47e0f5723491/3d7e/833d/tmps71tz21f/mg.paf.filter" in 26.3988 seconds
[2024-05-13T11:53:40+0800] [Thread-4 (statsAndLoggingAggregator)] [W] [toil.statsAndLogging] Got message from job at time 05-13-2024 11:53:40: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-filter_paf_deletions/instance-dnvkm07m/cleanup/file-6430b5c2ab4a43eca872868ce6bf923c/stream used 180.30% disk (49.2 GiB [52811338752B] used, 27.3 GiB [29290853784B] requested).
[2024-05-13T11:54:07+0800] [MainThread] [I] [toil.leader] Finished toil run successfully.
[2024-05-13T11:54:07+0800] [MainThread] [I] [toil.realtimeLogger] Stopping real-time logging server.
[2024-05-13T11:54:08+0800] [MainThread] [I] [toil.realtimeLogger] Joining real-time logging server thread.
Traceback (most recent call last):
  File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/bin/cactus-graphmap", line 8, in <module>
    sys.exit(main())
  File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap.py", line 99, in main
    graph_map(options)
  File "/beegfs/userhome/hcaoad/Software/cactus-bin-v2.8.2/venv-cactus-v2.8.2/lib/python3.10/site-packages/cactus/refmap/cactus_graphmap.py", line 112, in graph_map
    paf_id, gfa_fa_id, gaf_id, unfiltered_paf_id, paf_filter_log = toil.restart()
ValueError: too many values to unpack (expected 5)
glennhickey commented 6 months ago

OK, I think you've identified three issues 1) Cactus is not asking Toil for enough memory for filter-paf-deletions for this dataset 2) It is difficult to change options when running with --restart 3) --restart totally fails with cactus-graphmap

For 1. and 2., I think your best bet is to run with --doubleMem true. This should in theory restart jobs that run out of memory with twice the memory. I don't see any code changes since v2.7.2, so the different memory values must be due to differences in your data. If you do end up running again, can you please set export CACTUS_LOG_MEMORY=1 first? Then the log will report how much memory it actually used, and if you tell me I can try to revise the estimate.

For 3., this is definitely a bug, where the --restart flag was broken for cactus-graphmap. I've fixed it here https://github.com/ComparativeGenomicsToolkit/cactus/pull/1384. Thanks for reporting it!

Han-Cao commented 6 months ago

Thanks for your suggestions. I have re-run the whole workflow with the latest cactus_graphmap.py, it works now.

I tried to add export CACTUS_LOG_MEMORY=1, but it fails as the computation node of our HPC cannot access /usr/bin/time. I write a simple script to monitor the memory usage. As I occupy the whole node for filter-paf-deletions, the total memory usage should be very close to the usage of filter-paf-deletions

while true; do usage=$(free -g | awk 'FNR==2{print $3}') ; [[ $usage -gt 50 ]] && echo $usage; sleep 1s; done

The maximum memory of is 85G according to the above script.

glennhickey commented 6 months ago

Great, thanks for reporting the actual memory usage!! I've adjusted the memory requirement in #1387 accordingly.