Closed drkennetz closed 3 years ago
I would be happy to work on this myself as it is something that I want implemented... haha.
The place to look for this might be here (although it's moving in #3250): https://github.com/DataBiosphere/toil/blob/cf2984e9019e8c59528e2f462236cf628dc0bf10/src/toil/jobGraph.py#L135
We already have code to increase memory to --defaultMemory
when a job fails.
The trouble might be getting the information that a job ran out of memory specifically into that function, so that it can do more aggressive memory scaling without wasting time when the problem is not memory. We might need to define a more specific BatchJobExitReason
here and implement it in the batch systems that can recognize an out-of-memory error.
I like the idea of implementing memlimit as a BatchJobExitReason for schedulers that can handle this. I have started working through the logic on implementing this in LSF if you'd like me to pilot what that would look like. LSF has a well defined exit reason TERM_MEMLIMIT
that is already captured by your code. There should be a reasonably straightforward way to implement a new combination of BatchJobExitReason
related to Term Mem and a --doublemem
flag that would simply requeue jobs that had this BatchJobExitReason. We have something similar that monitors our BASH pipelines so I could potentially port some of that logic over.
@adamnovak I have a working POC although I think it could be implemented better. I added the --doubleMem
flag in common.py
in the appropriate section. In lsf.py
I am returning a different exit code (117 random so as not to interfere) if the exit_reason is "TERM_MEMLIMIT":
line 159 of lsf.py
if "TERM_MEMLIMIT" in exit_reason:
return 117
Then I have added another entry to BatchJobExitReason "MEMLIMIT".
Then in leader.py
in method _gatherUpdatedJobs
I say:
if exitStatus == 1117:
exitReason = BatchJobExitReason.MEMLIMIT # random number that won't be present on other systems
Then finally in jobGraph.py
I added some additional logic for the case where the exit reason is MEMLIMIT and the doubleMem flag has been set:
if exitReason == BatchJobExitReason.MEMLIMIT and config.doubleMem:
self._memory = self.memory * 2
logger.warning("We have doubled the memory of the failed job %s due to doubleMem flag and job failure",
self, self.memory)
I don't think this is the most elegant solution, but it works and would be reproducible on other batchSystems with a MEMLIMIT feature. Basically to reproduce for other batch schedulers:
If you think this is sufficient or just want to take a look I'm happy to PR.
Fixed in https://github.com/DataBiosphere/toil/pull/3313 ; thank you @drkennetz !
Referencing #2042 here. This is less of an issue and more of an enhancement. On batch systems such as HPC environments, it would be incredibly helpful to have a
--doublemem
flag for workflows.In short, the doublemem flag would be passed at runtime on the first instantiation of the workflow. This flag would tell toil to capture failure reasons of jobs, and if the failure reason corresponded to a TERM_MEMLIMIT or TERM_ENOMEM, the specific step that failed would be retried with double the initial requested memory. This would be helpful for genomics workflows with specific steps that did not have well-defined memory profiles (I think most commonly seen with Structural Variants or even STAR).
An example of this would be something like:
This would likely fail on about 50% of bwa jobs because bwa can use anywhere from 4-10G depending on samples / version of bwa. by running this with
--doublemem
, this would capture failures due to TERM_MEMLIMIT (this might be LSF specific), so for your 50% of failed jobs, they would automatically be restarted with 10G, making the section look like:Do you guys have thoughts on this?
Thanks, Dennis
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-689