DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
899 stars 241 forks source link

Running cactus on sge cluster where memory is product of h_vmem and cores #5063

Closed DustinSokolowski closed 1 month ago

DustinSokolowski commented 2 months ago

Hey! I hope you are doing well and thank you for your earlier help. I am at a later stage of running CACTUS on my sge. My cluster runs such that to total calculated memory is the product of the inputted memory and cores. For example for the instructions below, it is asking for 24*16 Gb (384) of memory.

#$ -l h_vmem=24G,h_rt=200:00:00,h_stack=32M
#$ -pe smp 16

So, the problem is that if I try to run cactus: Issued job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v1 with job batch system ID: 1 and disk: 26.6 Gi, memory: 223.5 Gi, cores: 16, accelerators: [], preemptible: False

The job gets queue'd forever since it thinks I'm asking for 3584 Gb of memory.

This being said, if I use the memory requirements that works for our cluster (e.g., below)

cactus jobstore cactus_in.txt target_ref.hal --binariesMode local --maxCores 16 --maxMemory 24G --realTimeLogging True --batchSystem grid_engine --workDir ./ --restart

It throws an error thinking I'm asking for 24Gb of memory total and it doesn't let me submit.

Hey! I hope you are doing well and thank you for your earlier help. I am at a later stage of running CACTUS on my sge.  My cluster runs such that to total calculated memory is the product of the inputted memory and cores. For example for the instructions below, it is asking for 24*16 Gb (384) of memory. 

$ -l h_vmem=24G,h_rt=200:00:00,h_stack=32M

$ -pe smp 16


So, the problem is that if I try to run cactus:
`Issued job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v1 with job batch system ID: 1 and disk: 26.6 Gi, memory: 223.5 Gi, cores: 16, accelerators: [], preemptible: False`

The job gets queue'd forever since it thinks I'm asking for 3584 Gb of memory.

This being said, if I use the memory requirements that works for our cluster (e.g., below)

`cactus jobstore cactus_in.txt target_ref.hal --binariesMode local --maxCores 16 --maxMemory 24G --realTimeLogging True --batchSystem grid_engine --workDir ./ --restart `

It throws an error thinking I'm asking for 24Gb of memory total and it doesn't let me submit.

toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v1 is requesting 240000000000 bytes of memory, more than the maximum of 24000000000 bytes of memory that GridEngineBatchSystem was configured with, or enforced by --maxMemory.


Do you know if TOIL has a workaround?
Best,
Dustin

Do you know if TOIL has a workaround? Best, Dustin

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1631

stxue1 commented 2 months ago

It sounds like there is a disagreement between Toil and the batch system. Toil's representation of memory is absolute compared to what the batch system thinks. The max memory resulting in InsufficientSystemResources is likely because one of the cactus jobs requires 240 GB, so if max memory is set below that value, certain cactus jobs will fail to be scheduled.

I'm not aware of a way to get around this in Toil, unfortunately. The fix is probably to tell Toil to support allocating memory when it is a multiple of the cores and memory value when talking to the batch system.

stxue1 commented 2 months ago

I made a quick fix at https://github.com/DataBiosphere/toil/tree/issues/5063-memory-product

I haven't really tested it, but from poking around, I think this should solve this issue.

pip3 install 'toil @ git+https://github.com/DataBiosphere/toil.git@issues/5063-memory-product'

Passing in --memoryIsProduct should enable the fix.

cactus jobstore cactus_in.txt target_ref.hal --binariesMode local --maxCores 16 --realTimeLogging True --batchSystem grid_engine --workDir ./ --restart --memoryIsProduct

I'd remove the --maxMemory 24G argument or set it to something higher (cactus probably wants at least 240GB) before running.

DustinSokolowski commented 2 months ago

Hey!

Thank you so much, I will test it and update you ASAP.

Best, Dustin

DustinSokolowski commented 2 months ago

Hey!

So far so good, I'll close once the job runs successfully.

Best, Dustin

DustinSokolowski commented 2 months ago

Hey! I gave it a try a couple times and it seems like I've run into another toil issue that I can't quite reconcile. I've attached the full log. I think that toil is unable to submit a later job with the --memoryIsProduct tag. I thought I saw that this step requires a lot of memory and that maybe toil/cactus was asking for more memory than I had. So I ran it once without --consMemory and once with --consMemory 250G

Do you happen to have any insight with this? Let me know if this is now better suited for the cactus team

[2024-08-26T05:37:36-0400] [MainThread] [I] [toil-rt] cactus_consolidated(Anc0): Attaching the sequence to the cactus root 1416, header chr1_GL456210_random with length 169725 and 2 total bases aligned and 0 bases aligned to other chromosome threads
[2024-08-26T05:52:10-0400] [MainThread] [I] [toil-rt] cactus_consolidated(Anc0): Ran cactus caf, 5064 seconds have elapsed
[2024-08-26T05:54:42-0400] [MainThread] [I] [toil-rt] cactus_consolidated(Anc0): Ran extended flowers ready for bar, 5216 seconds have elapsed
[2024-08-26T06:26:23-0400] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-08-26T07:26:25-0400] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-08-26T08:26:26-0400] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
[2024-08-26T09:26:28-0400] [MainThread] [I] [toil.leader] 1 jobs are running, 0 jobs are issued and waiting to run
Following jobs do not exist or permissions are not sufficient: 
107883687
[2024-08-26T09:27:50-0400] [MainThread] [W] [toil.leader] Job failed with exit value 137: 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v8
Exit reason: None
[2024-08-26T09:27:50-0400] [MainThread] [W] [toil.leader] Job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v8 has no new version available immediately. The batch system may have killed (or never started) the Toil worker.

product_consmem.txt

stxue1 commented 2 months ago

The job seems to have died/crashed for one reason or another. There seems to be 7 total failed jobs but since it crashed on cactus_cons specifically, it's likely the other 6 were retried successfully. It's possible it was killed by the batch system as the error message:

Following jobs do not exist or permissions are not sufficient: 

appears quite often, and this string isn't outputted anywhere in Toil. Some quick googling suggests that this error message may appear from qstat itself

My suspicion is that the internal polling system in gridengine.py detected that the job was killed somehow https://github.com/DataBiosphere/toil/blob/f23b1d3f8cf74cb4382d94d6dca4180e423d79df/src/toil/batchSystems/gridengine.py#L91-L98 and propagated it up to the leader https://github.com/DataBiosphere/toil/blob/f23b1d3f8cf74cb4382d94d6dca4180e423d79df/src/toil/leader.py#L281-L294

I think the best clue is in the exit code, as that was likely returned from qstat. An exit code of 137 likely means the job was killed by the batch system. Seems like 137 is commonly linked with running out of memory: SGE suggestion of memory and Java exit code 137 example. Though I do find it odd if the batch system complains about running out of memory when the Toil jobs don't complain about insufficient memory.

If it's not a memory issue (as Toil seems to be fine), it likely is some resource it ran out of. For example, this grid engine like batch system records exit code 137 as the job exceeded the time limit. On our Slurm cluster, we have a bunch of partitions with different time limits set up, for example, jobs shorter than one hour would go in the short partition while jobs longer would go in medium or long. Perhaps the batch system currently enforces some sort of time limit. I think the environmental variables TOIL_GRIDENGINE_ARGS and TOIL_GRIDENGINE_PE may help in this case, with an example here.

I'm not familiar with SGE, but it seems like qacct -j is able to recover some information about killed jobs.

stxue1 commented 2 months ago

If it's not a memory issue (as Toil seems to be fine), it likely is some resource it ran out of...

Building on this comment, both cactus jobs in the log reliably failed in almost exactly 5 hours. I'd lean more towards this being the batch system killing the job due to running out of time than memory.

[2024-08-25T23:26:13-0400] [MainThread] [I] [toil.leader] Issued job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v7 with job batch system ID: 1 and disk: 26.6 Gi, memory: 223.5 Gi, cores: 16, accelerators: [], preemptible: False
...
[2024-08-26T04:27:20-0400] [MainThread] [W] [toil.leader] Job failed with exit value 137: 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v7
Exit reason: None
...
...
...
[2024-08-26T04:27:21-0400] [MainThread] [I] [toil.leader] Issued job 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v8 with job batch system ID: 2 and disk: 26.6 Gi, memory: 223.5 Gi, cores: 16, accelerators: [], preemptible: False
...
[2024-08-26T09:27:50-0400] [MainThread] [W] [toil.leader] Job failed with exit value 137: 'cactus_cons' kind-cactus_cons/instance-n5u4p7ma v8
Exit reason: None
DustinSokolowski commented 2 months ago

Hey!

Thank you so much for this, it makes a lot of sense. I'm a bit surprised that cactus is running out of time since I'm testing this with two species, but that is independent of toil. As you suggested, I'll bost the walltime to 24h on TOIL_GRIDENGINE_ARGS and I'll look into if cactus has had runtime issues at this step for previous sge users (e.g, the job is finishing but not moving onto the next step for some reason). I'll update if there are additional issues that may be related to TOIL and close this issue once I get a hal file.

Thanks again! Dustin

DustinSokolowski commented 1 month ago

Success! Thank you for all of your support.

Best, Dustin