ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
511 stars 112 forks source link

out-of-memory error with TrimAndRecurseOnOutgroups #76

Open lassancejm opened 5 years ago

lassancejm commented 5 years ago

Hi,

I am seeing many jobs associated with the TrimAndRecurseOnOutgroups step failing due to lack of memory. I am using Slurm and toil v 3.19.

INFO:toil.leader:Issued job 'TrimAndRecurseOnOutgroups' J/1/jobmfa436 with job batch system ID: 62 and cores: 1, disk: 2.0 G, and memory: 3.0 G
INFO:toil.leader:Job ended successfully: 'TrimAndRecurseOnOutgroups' J/1/jobmfa436
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'TrimAndRecurseOnOutgroups' J/1/jobmfa436
WARNING:toil.leader:J/1/jobmfa436    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:J/1/jobmfa436    INFO:toil:Running Toil version 3.19.0-0feb1d4d1b4fc66062fc4dbc5d8f7fb046df39e6.
WARNING:toil.leader:J/1/jobmfa436    WARNING:toil.resource:'JTRES_487db7e35980364bd76c90413e988c94' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:J/1/jobmfa436    INFO:toil.lib.bioio:Calculating coverage of cigar file /tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpH6Bts9.tmp on /tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpnVJ0ja.tmp, writing to /tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpvNwxXG.tmp
WARNING:toil.leader:J/1/jobmfa436    INFO:cactus.shared.common:Running the command ['cactus_coverage', u'/tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpnVJ0ja.tmp', u'/tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpH6Bts9.tmp']
WARNING:toil.leader:J/1/jobmfa436    WARNING:toil.fileStore:LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job a/2/job9y9jN5/g/tmpBWyPnA-_serialiseJob-stream used 128.38% (2.6 GB [2756931584B] used, 2.0 GB [2147483648B] requested) at the end of its run.
WARNING:toil.leader:J/1/jobmfa436    Traceback (most recent call last):
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/toil/worker.py", line 324, in workerScript
WARNING:toil.leader:J/1/jobmfa436        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/cactus/shared/common.py", line 1094, in _runner
WARNING:toil.leader:J/1/jobmfa436        super(RoundedJob, self)._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:J/1/jobmfa436        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/toil/job.py", line 1296, in _run
WARNING:toil.leader:J/1/jobmfa436        return self.run(fileStore)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/cactus/blast/blast.py", line 279, in run
WARNING:toil.leader:J/1/jobmfa436        mostRecentResultsFile, outgroupCoverage)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/cactus/blast/blast.py", line 517, in calculateCoverage
WARNING:toil.leader:J/1/jobmfa436        parameters=["cactus_coverage"] + args)
WARNING:toil.leader:J/1/jobmfa436      File "/n/home01/lassance/.conda/envs/ENV_CACTUS/lib/python2.7/site-packages/cactus/shared/common.py", line 1038, in cactus_call
WARNING:toil.leader:J/1/jobmfa436        raise RuntimeError("Command %s failed with output: %s" % (call, output))
WARNING:toil.leader:J/1/jobmfa436    RuntimeError: Command ['cactus_coverage', u'/tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpnVJ0ja.tmp', u'/tmp/toil-540d735d-7b74-4320-89a3-3b8ea7c62d05-1e4ee729-0a33-477c-9099-d1a437c00af4/tmpcalaRv/cd8936ec-1b72-4a15-aa45-a1bdb3f592d0/tmpH6Bts9.tmp'] failed with output: None
WARNING:toil.leader:J/1/jobmfa436    ERROR:toil.worker:Exiting the worker because of a failed job on host holy7b01214.rc.fas.harvard.edu
WARNING:toil.leader:J/1/jobmfa436    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'TrimAndRecurseOnOutgroups' J/1/jobmfa436 with ID J/1/jobmfa436 to 0
WARNING:toil.leader:Job 'TrimAndRecurseOnOutgroups' J/1/jobmfa436 with ID J/1/jobmfa436 is completely failed
jasonsydes commented 5 years ago

Me too. I ended up altering the code here. I changed this:

        super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

        memory = 7900000000
        super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

lassancejm commented 5 years ago

Thanks, I will give this a try

esrice commented 5 years ago

Me too. I ended up altering the code here. I changed this:

        super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

        memory = 7900000000
        super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

diekhans commented 5 years ago

Would you create a pull request with this change?

Edward S. Rice notifications@github.com writes:

Me too. I ended up altering the code here. I changed this:

        super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

        memory = 7900000000
        super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/76#issuecomment-531835161 Me too. I ended up altering the code here. I changed this:

        super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

        memory = 7900000000
        super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

esrice commented 5 years ago

Sure, just did.

On Mon, Sep 16, 2019 at 5:50 PM Mark Diekhans notifications@github.com wrote:

Would you create a pull request with this change?

Edward S. Rice notifications@github.com writes:

Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

memory = 7900000000
super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory,
preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub:

https://github.com/ComparativeGenomicsToolkit/cactus/issues/76#issuecomment-531835161 Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).init(preemptable=True)

to this:

memory = 7900000000 super(TrimAndRecurseOnOutgroups, self).init(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/76?email_source=notifications&email_token=ABLL5JOIFRQ5ZR4JKJ3U7VLQJ6TNDA5CNFSM4HHMKQTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZTMIY#issuecomment-531838499, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLL5JNZIQCGKMZWF7O43ILQJ6TNDANCNFSM4HHMKQTA .

diekhans commented 5 years ago

Thanks!

The new default memory is too high for Travis. Can you come up with a default more similar to what was actually used before.

Mark

Edward S. Rice notifications@github.com writes:

Sure, just did.

On Mon, Sep 16, 2019 at 5:50 PM Mark Diekhans notifications@github.com wrote:

Would you create a pull request with this change?

Edward S. Rice notifications@github.com writes:

Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

memory = 7900000000
super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory,
preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub:

https://github.com/ComparativeGenomicsToolkit/cactus/issues/76#issuecomment-531835161 Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).init(preemptable=True)

to this:

memory = 7900000000 super(TrimAndRecurseOnOutgroups, self).init(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/76?email_source=notifications&email_token=ABLL5JOIFRQ5ZR4JKJ3U7VLQJ6TNDA5CNFSM4HHMKQTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZTMIY#issuecomment-531838499, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLL5JNZIQCGKMZWF7O43ILQJ6TNDANCNFSM4HHMKQTA .

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/76#issuecomment-531846650Sure, just did.

On Mon, Sep 16, 2019 at 5:50 PM Mark Diekhans notifications@github.com wrote:

Would you create a pull request with this change?

Edward S. Rice notifications@github.com writes:

Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).__init__(preemptable=True)

to this:

memory = 7900000000
super(TrimAndRecurseOnOutgroups, self).__init__(memory=memory,
preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub:

https://github.com/ComparativeGenomicsToolkit/cactus/issues/76# issuecomment-531835161 Me too. I ended up altering the code here. I changed this:

super(TrimAndRecurseOnOutgroups, self).init(preemptable=True)

to this:

memory = 7900000000 super(TrimAndRecurseOnOutgroups, self).init(memory=memory, preemptable=True)

and that has done the trick for me so far. No warranty of course.

Just wanted to confirm that I had the same problem and this fix worked. I had to give it about twice that much memory for a mammalian genome, though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/ComparativeGenomicsToolkit/cactus/issues/76?email_source= notifications&email_token= ABLL5JOIFRQ5ZR4JKJ3U7VLQJ6TNDA5CNFSM4HHMKQTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZTMIY

issuecomment-531838499>,

or mute the thread https://github.com/notifications/unsubscribe-auth/ ABLL5JNZIQCGKMZWF7O43ILQJ6TNDANCNFSM4HHMKQTA .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.*

esrice commented 5 years ago

Sure, just added a new commit. I actually did need all that memory to run it, so of course allocating memory dynamically rather than hard-coding would be ideal, but I don't have enough experience will toil to figure out how to do this easily. I'll try to figure it out at some point.

jasonsydes commented 5 years ago

I spent half a day trying to figure this out (maybe 6 months ago; perhaps things have changed since then), but in the end retreated to editing the code. Perhaps the cactus/toil developers could direct us to the correct approach?