Double Mem functionality with batch schedulers toil-cwl-runner

drkennetz commented 4 years ago

Referencing #2042 here. This is less of an issue and more of an enhancement. On batch systems such as HPC environments, it would be incredibly helpful to have a --doublemem flag for workflows.

In short, the doublemem flag would be passed at runtime on the first instantiation of the workflow. This flag would tell toil to capture failure reasons of jobs, and if the failure reason corresponded to a TERM_MEMLIMIT or TERM_ENOMEM, the specific step that failed would be retried with double the initial requested memory. This would be helpful for genomics workflows with specific steps that did not have well-defined memory profiles (I think most commonly seen with Structural Variants or even STAR).

An example of this would be something like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 5000
...

This would likely fail on about 50% of bwa jobs because bwa can use anywhere from 4-10G depending on samples / version of bwa. by running this with --doublemem, this would capture failures due to TERM_MEMLIMIT (this might be LSF specific), so for your 50% of failed jobs, they would automatically be restarted with 10G, making the section look like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 10000
...

Do you guys have thoughts on this?

Thanks, Dennis

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-689

drkennetz commented 4 years ago

I would be happy to work on this myself as it is something that I want implemented... haha.

adamnovak commented 4 years ago

The place to look for this might be here (although it's moving in #3250): https://github.com/DataBiosphere/toil/blob/cf2984e9019e8c59528e2f462236cf628dc0bf10/src/toil/jobGraph.py#L135

We already have code to increase memory to --defaultMemory when a job fails.

The trouble might be getting the information that a job ran out of memory specifically into that function, so that it can do more aggressive memory scaling without wasting time when the problem is not memory. We might need to define a more specific BatchJobExitReason here and implement it in the batch systems that can recognize an out-of-memory error.

drkennetz commented 4 years ago

I like the idea of implementing memlimit as a BatchJobExitReason for schedulers that can handle this. I have started working through the logic on implementing this in LSF if you'd like me to pilot what that would look like. LSF has a well defined exit reason TERM_MEMLIMIT that is already captured by your code. There should be a reasonably straightforward way to implement a new combination of BatchJobExitReason related to Term Mem and a --doublemem flag that would simply requeue jobs that had this BatchJobExitReason. We have something similar that monitors our BASH pipelines so I could potentially port some of that logic over.

drkennetz commented 4 years ago

@adamnovak I have a working POC although I think it could be implemented better. I added the --doubleMem flag in common.py in the appropriate section. In lsf.py I am returning a different exit code (117 random so as not to interfere) if the exit_reason is "TERM_MEMLIMIT": line 159 of lsf.py

                             if "TERM_MEMLIMIT" in exit_reason:
                                return 117

Then I have added another entry to BatchJobExitReason "MEMLIMIT". Then in leader.py in method _gatherUpdatedJobs I say:

        if exitStatus == 1117:
            exitReason = BatchJobExitReason.MEMLIMIT # random number that won't be present on other systems

Then finally in jobGraph.py I added some additional logic for the case where the exit reason is MEMLIMIT and the doubleMem flag has been set:

        if exitReason == BatchJobExitReason.MEMLIMIT and config.doubleMem:
            self._memory = self.memory * 2
            logger.warning("We have doubled the memory of the failed job %s due to doubleMem flag and job failure",
                           self, self.memory)

I don't think this is the most elegant solution, but it works and would be reproducible on other batchSystems with a MEMLIMIT feature. Basically to reproduce for other batch schedulers:

Catch the MEMLIMIT failure
return 1117 where that exit code is returned, instead of the default exit code.

If you think this is sufficient or just want to take a look I'm happy to PR.

mr-c commented 3 years ago

Fixed in https://github.com/DataBiosphere/toil/pull/3313 ; thank you @drkennetz !

DataBiosphere / toil

Double Mem functionality with batch schedulers toil-cwl-runner #3269