ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
511 stars 112 forks source link

cactus_consolidated: Job used more disk than requested #1023

Open thekswenson opened 1 year ago

thekswenson commented 1 year ago

I'm running into issues in the cactus_consolidated step.

I ran catus with the flags: --defaultCores 30 --defaultDisk 5G --maxCores 32 --maxMemory 27G --realTimeLogging.

The --defaultDisk 5G parameter was added because of, but did not take care of the following "Job used more disk than requested" error:

        [2023-05-24T15:40:24+0200] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the dis
k requirement. Job files/for-job/kind-cactus_cons/instance-tgsljr24/cleanup/file-12ae577070b1401d9a1c2b076ab2d91a/stream used 111.97% disk (2.2 GiB [2404446208B] used, 2.0 GiB [2147483648B] requested).
        Traceback (most recent call last):                
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/toil/worker.py", line 403, in workerScript
            job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)                         
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/toil/job.py", line 2743, in _runner
            returnValues = self._run(jobGraph=None, fileStore=fileStore)                                            
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/toil/job.py", line 2660, in _run
            return self.run(fileStore)                    
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/toil/job.py", line 2888, in run
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)                                   
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/cactus/pipeline/cactus_workflow.py", line 95, in cactus_cons
            messages = cactus_call(check_output=True, returnStdErr=True,                                            
          File "/home/kms/software/miniconda3/envs/cactus/lib/python3.8/site-packages/cactus/shared/common.py", line 845, in cactus_call
            raise RuntimeError("{}Command {} signaled {}: {}".format(sigill_msg, call, signal.Signals(-process.returncode).name, out))
        RuntimeError: Command ['cactus_consolidated', '--sequences', 'Amer /tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Amer.fa Acol /tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Acol.fa Aatr /tmp/8541de
ce3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Aatr.fa Astel /tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Astel.fa', '--speciesTree', '(Aatr:1.0,(Astel:1.0,(Amer:1.0,Acol:1.0)Anc3:1.0)Anc2:1.0)Anc1;', '--logLevel', 
'INFO', '--alignments', '/tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Anc3_primary.paf', '--params', '/tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Anc3.config.xml', '--outputFile', '/tmp/8541dece3ff95c3
984233df072ac1590/fff5/0acd/tmpwxloeghw/Anc3.c2h', '--outputHalFastaFile', '/tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Anc3.c2h.fa', '--outputReferenceFile', '/tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloe
ghw/Anc3.ref', '--outgroupEvents', 'Astel Aatr', '--referenceEvent', 'Anc3', '--threads', '32', '--secondaryAlignments', '/tmp/8541dece3ff95c3984233df072ac1590/fff5/0acd/tmpwxloeghw/Anc3_secondary.paf'] signaled SIGKILL: 
        [2023-05-24T15:40:24+0200] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host winters
glennhickey commented 1 year ago

Are you running on a cluster? The "used more disk than requested" message is a warning, not an error. But if your cluster (maybe LSF) is very strict, this could indeed get your job killed. Likewise using more memory than requested could be a problem on any cluster including SLURM.

In any case, it does look like your job is getting killed (signaled SIGKILL), and that usually happens due to lack of memory. If you're not on a cluster, you will simply need to find a system with more memory. If you are on a cluster, the disk/memory requirements should be fixed in an upcoming release (we are presently testing on SLURM).

In all cases, I don't expect any of the --defaultCores 30 --defaultDisk 5G --maxCores 32 --maxMemory 27G --realTimeLogging flags to do anything.

thekswenson commented 1 year ago

Hmmm... I'm running it on my desktop. I have 32G on this machine and am not experiencing any of the usual symptoms associated to lack of memory before the program crashes. That is, my swap does not fill up, and there are no paging/thrashing issues as far as i can tell. I just tried to keep my eye on htop and, while I didn't stare at it the whole time, it looks like it peaked at 22G.

Is there anything else it could be?

NOTE: I had to add maxCores=32 due to another bug (that is reported here, but I can't find it at the moment). I'll remove the other flags.

thekswenson commented 1 year ago

Is there a way to be sure that it is a memory issue? I just kept one eye on the process, and it got SIGKILL after ~30 minutes of running, and I had >15G of memory free at the time.

glennhickey commented 1 year ago

Not sure what else it can be. You can export CACTUS_LOG_MEMORY=1 to track some of the memory usage. You can also verify your ulimit -a.

thekswenson commented 1 year ago

Thanks for the tip. After setting export CACTUS_LOG_MEMORY=1, where will I find the log?