KTServer connection fails (ST_KV_DATABASE_EXCEPTION) while running Cactus on an SGE cluster

Hello,

I'm trying to run the evolverMammals example on an SGE cluster (where Docker and Singularity aren't supported) using the latest version of Cactus installed through git. My problem seems similar to another recent issue report: https://github.com/ComparativeGenomicsToolkit/cactus/issues/57.

In case it's relevant, here are a few notes about how I installed Cactus. I first compiled the older version (progressiveCactus) from GitHub, because this automatically downloads and compiles the needed dependencies, including Kyoto Tycoon (the newest version of Cactus doesn't include this). I then sourced the environment from progressiveCactus, compiled the newer version of Cactus and installed it via Pip into a freshly created Conda environment.

The evolverMammals test works fine for me on a single node, i.e. when running the following through qsub:

cactus --binariesMode local cactusWork evolverMammals-offline.txt evolverMammals.hal --root mr

However, things are failing when running distributed on multiple nodes of an SGE queue, as such:

cactus --binariesMode local cactusWork evolverMammals-offline.txt evolverMammals.hal --root mr --batchSystem gridEngine --workDir /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp --logInfo --logFile cactus.log --maxCores 32 --disableCaching

Below are some of the errors I get. There are multiple retries, but the job never manages to continue successfully. The cluster nodes should be able to communicate to each other, so I'm not sure about what could cause the ST_KV_DATABASE_EXCEPTION messages.

How can I get Cactus and KTServer to work properly when running with --batchSystem gridEngine?

INFO:toil.leader:Issued job 'StartPrimaryDB' D/F/jobwjfXJl with job batch system ID: 150 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'StartPrimaryDB' D/F/jobwjfXJl
INFO:toil.leader:Issued job 'KtServerService' B/T/jobNX_Wvk with job batch system ID: 151 and cores: 0, disk: 2.0 G, and memory: 2.3 G
INFO:toil.leader:Issued job 'CactusSetupPhase' G/Y/jobmegNP3 with job batch system ID: 152 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'KtServerService' B/T/jobNX_Wvk
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'KtServerService' B/T/jobNX_Wvk
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:B/T/jobNX_Wvk    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:B/T/jobNX_Wvk    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['netstat', '-tuplen']
WARNING:toil.leader:B/T/jobNX_Wvk    (No info could be read for "-p": geteuid()=98354 but you should be root.)
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '29439', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpxersIz/e6b71d4f-cc17-405b-9945-bf74e2503b84/t7jtQpA/snapshot', '-bgsc', 'lzo', '-bgsi', '1000000', '-log', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpxersIz/e6b71d4f-cc17-405b-9945-bf74e2503b84/tmpdU7AZM.tmp', ':#opts=ls#bnum=30m#msiz=50g#ktopts=p']
WARNING:toil.leader:B/T/jobNX_Wvk    terminate called after throwing an instance of 'std::runtime_error'
WARNING:toil.leader:B/T/jobNX_Wvk      what():  pthread_create
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'get', '-port', '29439', '-host', '172.16.13.37', 'TERMINATE']
WARNING:toil.leader:B/T/jobNX_Wvk    Process ServerProcess-1:
WARNING:toil.leader:B/T/jobNX_Wvk    Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
WARNING:toil.leader:B/T/jobNX_Wvk        self.run()
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:B/T/jobNX_Wvk        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: KTServer failed. Log: 2019-03-08T10:07:26.636823+02:00: [SYSTEM]: ================ [START]: pid=20742
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.637007+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638447+02:00: [SYSTEM]: starting the server: expr=:29439
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638549+02:00: [SYSTEM]: server socket opened: expr=:29439 timeout=200000.0
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638575+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:B/T/jobNX_Wvk    
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'set', '-port', '29439', '-host', '172.16.13.37', 'TERMINATE', '1']
WARNING:toil.leader:B/T/jobNX_Wvk    ktremotemgr: DB::open failed: : 6: network error: connection failed
WARNING:toil.leader:B/T/jobNX_Wvk    Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:B/T/jobNX_Wvk        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:B/T/jobNX_Wvk        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1694, in _run
WARNING:toil.leader:B/T/jobNX_Wvk        returnValues = self.run(fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1673, in run
WARNING:toil.leader:B/T/jobNX_Wvk        if not service.check():
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverToil.py", line 55, in check
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError(msg)
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:B/T/jobNX_Wvk        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: KTServer failed. Log: 2019-03-08T10:07:26.636823+02:00: [SYSTEM]: ================ [START]: pid=20742
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.637007+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638447+02:00: [SYSTEM]: starting the server: expr=:29439
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638549+02:00: [SYSTEM]: server socket opened: expr=:29439 timeout=200000.0
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:07:26.638575+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:B/T/jobNX_Wvk    
WARNING:toil.leader:B/T/jobNX_Wvk    
WARNING:toil.leader:B/T/jobNX_Wvk    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn37.grid.pub.ro
WARNING:toil.leader:B/T/jobNX_Wvk    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' B/T/jobNX_Wvk with ID B/T/jobNX_Wvk to 5
INFO:toil.leader:Issued job 'KtServerService' B/T/jobNX_Wvk with job batch system ID: 153 and cores: 0, disk: 2.0 G, and memory: 2.3 G
INFO:toil.leader:Job ended successfully: 'CactusSetupPhase' G/Y/jobmegNP3
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusSetupPhase' G/Y/jobmegNP3
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:G/Y/jobmegNP3    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.lib.bioio:Sequences in cactus setup: ['simHuman_chr6', 'simMouse_chr6', 'simRat_chr6', 'simCow_chr6', 'simDog_chr6']
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.lib.bioio:Sequences in cactus setup filenames: ['>id=1|simHuman.chr6|0\n', '>id=0|simMouse.chr6\n', '>id=2|simRat.chr6\n', '>id=4|simCow.chr6|0\n', '>id=3|simDog.chr6|0\n']
WARNING:toil.leader:G/Y/jobmegNP3    INFO:cactus.shared.common:Running the command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp3wGM7F.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpqkriEI.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpo20GAf.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpL6ca4z.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp5PbyOe.tmp']
WARNING:toil.leader:G/Y/jobmegNP3    Set log level to INFO
WARNING:toil.leader:G/Y/jobmegNP3    Flower disk name : <st_kv_database_conf type="kyoto_tycoon">
WARNING:toil.leader:G/Y/jobmegNP3               <kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />
WARNING:toil.leader:G/Y/jobmegNP3           </st_kv_database_conf>
WARNING:toil.leader:G/Y/jobmegNP3       
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp3wGM7F.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpqkriEI.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpo20GAf.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpL6ca4z.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp5PbyOe.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 172.16.13.37 with error: network error
WARNING:toil.leader:G/Y/jobmegNP3    Uncaught exception
WARNING:toil.leader:G/Y/jobmegNP3    Traceback (most recent call last):
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:G/Y/jobmegNP3        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1096, in _runner
WARNING:toil.leader:G/Y/jobmegNP3        super(RoundedJob, self)._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:G/Y/jobmegNP3        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1296, in _run
WARNING:toil.leader:G/Y/jobmegNP3        return self.run(fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/cactus_workflow.py", line 641, in run
WARNING:toil.leader:G/Y/jobmegNP3        makeEventHeadersAlphaNumeric=self.getOptionalPhaseAttrib("makeEventHeadersAlphaNumeric", bool, False))
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 220, in runCactusSetup
WARNING:toil.leader:G/Y/jobmegNP3        parameters=["cactus_setup"] + args + sequences)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1040, in cactus_call
WARNING:toil.leader:G/Y/jobmegNP3        raise RuntimeError("Command %s failed with output: %s" % (call, output))
WARNING:toil.leader:G/Y/jobmegNP3    RuntimeError: Command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp3wGM7F.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpqkriEI.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpo20GAf.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmpL6ca4z.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmprNSNY8/9bdf7175-8ea1-4f43-a01a-815454f61b67/tmp5PbyOe.tmp'] failed with output: 
WARNING:toil.leader:G/Y/jobmegNP3    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn41.grid.pub.ro
WARNING:toil.leader:G/Y/jobmegNP3    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusSetupPhase' G/Y/jobmegNP3 with ID G/Y/jobmegNP3 to 5
INFO:toil.leader:Issued job 'CactusSetupPhase' G/Y/jobmegNP3 with job batch system ID: 154 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'CactusSetupPhase' G/Y/jobmegNP3
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusSetupPhase' G/Y/jobmegNP3
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:G/Y/jobmegNP3    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.lib.bioio:Sequences in cactus setup: ['simHuman_chr6', 'simMouse_chr6', 'simRat_chr6', 'simCow_chr6', 'simDog_chr6']
WARNING:toil.leader:G/Y/jobmegNP3    INFO:toil.lib.bioio:Sequences in cactus setup filenames: ['>id=1|simHuman.chr6|0\n', '>id=0|simMouse.chr6\n', '>id=2|simRat.chr6\n', '>id=4|simCow.chr6|0\n', '>id=3|simDog.chr6|0\n']
WARNING:toil.leader:G/Y/jobmegNP3    INFO:cactus.shared.common:Running the command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp6DYzJV.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp4U7wPE.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp2oM2za.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpUgFVai.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpWG2t9T.tmp']
WARNING:toil.leader:G/Y/jobmegNP3    Set log level to INFO
WARNING:toil.leader:G/Y/jobmegNP3    Flower disk name : <st_kv_database_conf type="kyoto_tycoon">
WARNING:toil.leader:G/Y/jobmegNP3               <kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />
WARNING:toil.leader:G/Y/jobmegNP3           </st_kv_database_conf>
WARNING:toil.leader:G/Y/jobmegNP3       
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp6DYzJV.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp4U7wPE.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp2oM2za.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpUgFVai.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpWG2t9T.tmp
WARNING:toil.leader:G/Y/jobmegNP3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 172.16.13.37 with error: network error
WARNING:toil.leader:G/Y/jobmegNP3    Uncaught exception
WARNING:toil.leader:G/Y/jobmegNP3    Traceback (most recent call last):
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:G/Y/jobmegNP3        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1096, in _runner
WARNING:toil.leader:G/Y/jobmegNP3        super(RoundedJob, self)._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:G/Y/jobmegNP3        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1296, in _run
WARNING:toil.leader:G/Y/jobmegNP3        return self.run(fileStore)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/cactus_workflow.py", line 641, in run
WARNING:toil.leader:G/Y/jobmegNP3        makeEventHeadersAlphaNumeric=self.getOptionalPhaseAttrib("makeEventHeadersAlphaNumeric", bool, False))
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 220, in runCactusSetup
WARNING:toil.leader:G/Y/jobmegNP3        parameters=["cactus_setup"] + args + sequences)
WARNING:toil.leader:G/Y/jobmegNP3      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1040, in cactus_call
WARNING:toil.leader:G/Y/jobmegNP3        raise RuntimeError("Command %s failed with output: %s" % (call, output))
WARNING:toil.leader:G/Y/jobmegNP3    RuntimeError: Command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.37" port="29439" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp6DYzJV.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp4U7wPE.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmp2oM2za.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpUgFVai.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-88711676-5948-47d0-acd3-569974301115/tmpKeNbt7/ef794177-3ebd-4c51-8ae5-971a58ac7d96/tmpWG2t9T.tmp'] failed with output: 
WARNING:toil.leader:G/Y/jobmegNP3    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn35.grid.pub.ro
WARNING:toil.leader:G/Y/jobmegNP3    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusSetupPhase' G/Y/jobmegNP3 with ID G/Y/jobmegNP3 to 4
INFO:toil.leader:Issued job 'CactusSetupPhase' G/Y/jobmegNP3 with job batch system ID: 155 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'KtServerService' B/T/jobNX_Wvk
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'KtServerService' B/T/jobNX_Wvk
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:B/T/jobNX_Wvk    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:B/T/jobNX_Wvk    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['netstat', '-tuplen']
WARNING:toil.leader:B/T/jobNX_Wvk    (No info could be read for "-p": geteuid()=98354 but you should be root.)
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '26666', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-b02a3811-2b63-4208-851d-7815af46a62d/tmp2i3iEe/bbd7502d-1905-454c-8a42-2a91f1f28f96/tTaU1Kr/snapshot', '-bgsc', 'lzo', '-bgsi', '1000000', '-log', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-b02a3811-2b63-4208-851d-7815af46a62d/tmp2i3iEe/bbd7502d-1905-454c-8a42-2a91f1f28f96/tmpFZwVOe.tmp', ':#opts=ls#bnum=30m#msiz=50g#ktopts=p']
WARNING:toil.leader:B/T/jobNX_Wvk    terminate called after throwing an instance of 'std::runtime_error'
WARNING:toil.leader:B/T/jobNX_Wvk      what():  pthread_create
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'get', '-port', '26666', '-host', '172.16.13.39', 'TERMINATE']
WARNING:toil.leader:B/T/jobNX_Wvk    Process ServerProcess-1:
WARNING:toil.leader:B/T/jobNX_Wvk    Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
WARNING:toil.leader:B/T/jobNX_Wvk        self.run()
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:B/T/jobNX_Wvk        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: KTServer failed. Log: 2019-03-08T10:11:46.189587+02:00: [SYSTEM]: ================ [START]: pid=9125
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.189773+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191313+02:00: [SYSTEM]: starting the server: expr=:26666
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191411+02:00: [SYSTEM]: server socket opened: expr=:26666 timeout=200000.0
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191438+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:B/T/jobNX_Wvk    
WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'set', '-port', '26666', '-host', '172.16.13.39', 'TERMINATE', '1']
WARNING:toil.leader:B/T/jobNX_Wvk    ktremotemgr: DB::open failed: : 6: network error: connection failed
WARNING:toil.leader:B/T/jobNX_Wvk    Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:B/T/jobNX_Wvk        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:B/T/jobNX_Wvk        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1694, in _run
WARNING:toil.leader:B/T/jobNX_Wvk        returnValues = self.run(fileStore)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1673, in run
WARNING:toil.leader:B/T/jobNX_Wvk        if not service.check():
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverToil.py", line 55, in check
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError(msg)
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: Traceback (most recent call last):
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:B/T/jobNX_Wvk        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:B/T/jobNX_Wvk      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:B/T/jobNX_Wvk        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:B/T/jobNX_Wvk    RuntimeError: KTServer failed. Log: 2019-03-08T10:11:46.189587+02:00: [SYSTEM]: ================ [START]: pid=9125
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.189773+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191313+02:00: [SYSTEM]: starting the server: expr=:26666
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191411+02:00: [SYSTEM]: server socket opened: expr=:26666 timeout=200000.0
WARNING:toil.leader:B/T/jobNX_Wvk    2019-03-08T10:11:46.191438+02:00: [SYSTEM]: listening server socket started: fd=4

It looks like the Kyoto Tycoon server is starting up, trying to spin up a thread (maybe in response to a connection attempt), and then crashing:

WARNING:toil.leader:B/T/jobNX_Wvk    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '29439', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpxersIz/e6b71d4f-cc17-405b-9945-bf74e2503b84/t7jtQpA/snapshot', '-bgsc', 'lzo', '-bgsi', '1000000', '-log', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-f97fff5e-27d1-4f96-a60e-e3618942fc1e-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpxersIz/e6b71d4f-cc17-405b-9945-bf74e2503b84/tmpdU7AZM.tmp', ':#opts=ls#bnum=30m#msiz=50g#ktopts=p']
WARNING:toil.leader:B/T/jobNX_Wvk    terminate called after throwing an instance of 'std::runtime_error'
WARNING:toil.leader:B/T/jobNX_Wvk      what():  pthread_create

Can you manually start up Kyoto Tycoon on one of your cluster nodes and connect to it with ktremotemgr? Is there something weird with your ulimits (like a thread limit imposed by SGE?) that would prevent a thread from being created?

Yes, just to chime in, the pthread_create failure is definitely the problem here. Toil doesn't give us visibility into the service being alive/dead, so the jobs barrel on regardless. Sometimes that's a limit on the number of threads (RLIMIT_NPROC, which should be visible if running ulimit -a).

@adamnovak and @joelarmstrong, thanks a lot for your input.

I connected to a couple of cluster nodes and manually ran ktserver and ktremotemgr. I used commands as similar as possible to the ones issued by Toil:

ktserver -port 29439 -ls -tout 200000 -th 64 -bgs /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-test-ktserver -bgsc lzo -bgsi 1000000 -log /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-test-ktserver/ktserver.log :#opts=ls#bnum=30m#msiz=50g#ktopts=p

and

ktremotemgr set -port 29439 -host 172.16.13.35 TERMINATE 1

I made sure to set up the server host's IP and ports properly. After running ktremotemgr, I see the following in the log file:

$ cat /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-test-ktserver/ktserver.log
2019-03-09T11:35:17.877733+02:00: [SYSTEM]: ================ [START]: pid=9251
2019-03-09T11:35:17.877926+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
2019-03-09T11:35:17.879101+02:00: [SYSTEM]: starting the server: expr=:29439
2019-03-09T11:35:17.879201+02:00: [SYSTEM]: server socket opened: expr=:29439 timeout=200000.0
2019-03-09T11:35:17.879226+02:00: [SYSTEM]: listening server socket started: fd=4

The limit on the number of threads seems to be set to around 500,000:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 511096
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 16777216
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 511096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Hm. That's definitely weird.

Can you enable core dumps on the system it fails on, and get a core dump for the crashed ktserver loaded up in gdb to get a stack trace of it? Or maybe we could change Cactus to start ktserver under the debugger?

Thanks for following up. I'm not an admin on the cluster that I use, so I can't really enable core dumps there. How can I get Cactus to start ktserver under a debugger?

P.S. I've never really used gdb, so any additional details and advice you'd have about this would be quite helpful.

To get the server under GDB you would have to modify the function that generates the server command.

It would have to come out as something like:

['gdb', original_list[0], '-ex', 'run ' + " ".join(original_list[1:]), '-ex', 'bt', '-ex', 'quit 1']

That would run the server with the arguments from the original list (with the command name clipped off), and, if it crashes, print a backtrace. (It also makes the command always look to have failed to Toil, so you can see the backtrace.)

This might be getting a bit off in the weeds, though. Maybe @joelarmstrong has an idea of why ktserver would fail under Cactus and succeed when run manually.

I gave this a try and modified line 218 to

    return ['gdb', cmd[0], '-ex', 'run ' + " ".join(cmd[1:]), '-ex', 'bt', '-ex', 'quit 1']

in the Python scripts src/cactus/pipeline/ktserverControl.py and build/lib/cactus/pipeline/ktserverControl.py.

I'm not sure if what I did had the right effect, because I seem to be getting the same error messages as in the first post of this thread (i.e. I can't see any extra backtrace messages:

INFO:toil.leader:Issued job 'CactusSetupPhase' 4/p/jobO9rolI with job batch system ID: 152 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'CactusSetupPhase' 4/p/jobO9rolI
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusSetupPhase' 4/p/jobO9rolI
WARNING:toil.leader:4/p/jobO9rolI    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:4/p/jobO9rolI    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:4/p/jobO9rolI    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:4/p/jobO9rolI    INFO:toil.lib.bioio:Sequences in cactus setup: ['simHuman_chr6', 'simMouse_chr6', 'simRat_chr6', 'simCow_chr6', 'simDog_chr6']
WARNING:toil.leader:4/p/jobO9rolI    INFO:toil.lib.bioio:Sequences in cactus setup filenames: ['>id=1|simHuman.chr6|0\n', '>id=0|simMouse.chr6\n', '>id=2|simRat.chr6\n', '>id=4|simCow.chr6|0\n', '>id=3|simDog.chr6|0\n']
WARNING:toil.leader:4/p/jobO9rolI    INFO:cactus.shared.common:Running the command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.34" port="19304" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpm9LILA.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpbDBCgg.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpoT0Wya.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpQcr9qd.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmp4i0fxN.tmp']
WARNING:toil.leader:4/p/jobO9rolI    Set log level to INFO
WARNING:toil.leader:4/p/jobO9rolI    Flower disk name : <st_kv_database_conf type="kyoto_tycoon">
WARNING:toil.leader:4/p/jobO9rolI               <kyoto_tycoon database_dir="fakepath" host="172.16.13.34" port="19304" />
WARNING:toil.leader:4/p/jobO9rolI           </st_kv_database_conf>
WARNING:toil.leader:4/p/jobO9rolI       
WARNING:toil.leader:4/p/jobO9rolI    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpm9LILA.tmp
WARNING:toil.leader:4/p/jobO9rolI    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpbDBCgg.tmp
WARNING:toil.leader:4/p/jobO9rolI    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpoT0Wya.tmp
WARNING:toil.leader:4/p/jobO9rolI    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpQcr9qd.tmp
WARNING:toil.leader:4/p/jobO9rolI    Sequence file/directory /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmp4i0fxN.tmp
WARNING:toil.leader:4/p/jobO9rolI    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 172.16.13.34 with error: network error
WARNING:toil.leader:4/p/jobO9rolI    Uncaught exception
WARNING:toil.leader:4/p/jobO9rolI    Traceback (most recent call last):
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:4/p/jobO9rolI        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1096, in _runner
WARNING:toil.leader:4/p/jobO9rolI        super(RoundedJob, self)._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:4/p/jobO9rolI        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1296, in _run
WARNING:toil.leader:4/p/jobO9rolI        return self.run(fileStore)
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/cactus_workflow.py", line 641, in run
WARNING:toil.leader:4/p/jobO9rolI        makeEventHeadersAlphaNumeric=self.getOptionalPhaseAttrib("makeEventHeadersAlphaNumeric", bool, False))
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 220, in runCactusSetup
WARNING:toil.leader:4/p/jobO9rolI        parameters=["cactus_setup"] + args + sequences)
WARNING:toil.leader:4/p/jobO9rolI      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/shared/common.py", line 1040, in cactus_call
WARNING:toil.leader:4/p/jobO9rolI        raise RuntimeError("Command %s failed with output: %s" % (call, output))
WARNING:toil.leader:4/p/jobO9rolI    RuntimeError: Command ['cactus_setup', '--speciesTree', '((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;', '--cactusDisk', '<st_kv_database_conf type="kyoto_tycoon">\n\t\t\t<kyoto_tycoon database_dir="fakepath" host="172.16.13.34" port="19304" />\n\t\t</st_kv_database_conf>\n\t', '--logLevel', 'INFO', '--outgroupEvents', 'simHuman_chr6 simDog_chr6 simCow_chr6', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpm9LILA.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpbDBCgg.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpoT0Wya.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmpQcr9qd.tmp', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-6386a5c9-5d92-486a-9720-412b1ca610f6/tmpwnTfyQ/e21f7049-ce8e-4be4-ba86-8ff2cd1e7ff0/tmp4i0fxN.tmp'] failed with output: 
WARNING:toil.leader:4/p/jobO9rolI    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn37.grid.pub.ro
WARNING:toil.leader:4/p/jobO9rolI    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusSetupPhase' 4/p/jobO9rolI with ID 4/p/jobO9rolI to 5
INFO:toil.leader:Issued job 'CactusSetupPhase' 4/p/jobO9rolI with job batch system ID: 153 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.leader:Job ended successfully: 'KtServerService' 4/n/job2cy6_X
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'KtServerService' 4/n/job2cy6_X
WARNING:toil.leader:4/n/job2cy6_X    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:4/n/job2cy6_X    INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:4/n/job2cy6_X    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:4/n/job2cy6_X    WARNING:toil.resource:'JTRES_5d2f846cd67858267ed5af4717d96bda' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
WARNING:toil.leader:4/n/job2cy6_X    INFO:cactus.shared.common:Running the command ['netstat', '-tuplen']
WARNING:toil.leader:4/n/job2cy6_X    (No info could be read for "-p": geteuid()=98354 but you should be root.)
WARNING:toil.leader:4/n/job2cy6_X    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '19304', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-906375ca-4612-4b9c-aace-ad64276eee7f/tmpgECj07/f55c3afe-5488-482b-aef1-ec94770f96a6/tMWAk0n/snapshot', '-bgsc', 'lzo', '-bgsi', '1000000', '-log', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-906375ca-4612-4b9c-aace-ad64276eee7f/tmpgECj07/f55c3afe-5488-482b-aef1-ec94770f96a6/tmpBPs61G.tmp', ':#opts=ls#bnum=30m#msiz=50g#ktopts=p']
WARNING:toil.leader:4/n/job2cy6_X    terminate called after throwing an instance of 'std::runtime_error'
WARNING:toil.leader:4/n/job2cy6_X      what():  pthread_create
WARNING:toil.leader:4/n/job2cy6_X    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:4/n/job2cy6_X    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:4/n/job2cy6_X    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:4/n/job2cy6_X    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'get', '-port', '19304', '-host', '172.16.13.34', 'TERMINATE']
WARNING:toil.leader:4/n/job2cy6_X    Process ServerProcess-1:
WARNING:toil.leader:4/n/job2cy6_X    Traceback (most recent call last):
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
WARNING:toil.leader:4/n/job2cy6_X        self.run()
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:4/n/job2cy6_X        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:4/n/job2cy6_X        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:4/n/job2cy6_X    RuntimeError: KTServer failed. Log: 2019-03-12T17:15:39.542399+02:00: [SYSTEM]: ================ [START]: pid=2688
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.542587+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544186+02:00: [SYSTEM]: starting the server: expr=:19304
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544289+02:00: [SYSTEM]: server socket opened: expr=:19304 timeout=200000.0
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544316+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:4/n/job2cy6_X    
WARNING:toil.leader:4/n/job2cy6_X    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'set', '-port', '19304', '-host', '172.16.13.34', 'TERMINATE', '1']
WARNING:toil.leader:4/n/job2cy6_X    ktremotemgr: DB::open failed: : 6: network error: connection failed
WARNING:toil.leader:4/n/job2cy6_X    Traceback (most recent call last):
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:4/n/job2cy6_X        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:4/n/job2cy6_X        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1694, in _run
WARNING:toil.leader:4/n/job2cy6_X        returnValues = self.run(fileStore)
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1673, in run
WARNING:toil.leader:4/n/job2cy6_X        if not service.check():
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverToil.py", line 55, in check
WARNING:toil.leader:4/n/job2cy6_X        raise RuntimeError(msg)
WARNING:toil.leader:4/n/job2cy6_X    RuntimeError: Traceback (most recent call last):
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:4/n/job2cy6_X        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:4/n/job2cy6_X      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:4/n/job2cy6_X        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:4/n/job2cy6_X    RuntimeError: KTServer failed. Log: 2019-03-12T17:15:39.542399+02:00: [SYSTEM]: ================ [START]: pid=2688
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.542587+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544186+02:00: [SYSTEM]: starting the server: expr=:19304
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544289+02:00: [SYSTEM]: server socket opened: expr=:19304 timeout=200000.0
WARNING:toil.leader:4/n/job2cy6_X    2019-03-12T17:15:39.544316+02:00: [SYSTEM]: listening server socket started: fd=4

I was also wondering, could these problems somehow be related to the errors that I got a few days ago (java.lang.OutOfMemoryError: unable to create new native thread), while running Bcbio-nextgen with Toil on the same SGE cluster?

https://github.com/bcbio/bcbio-nextgen/issues/2697#issuecomment-469224159

This issue might have something to do with that other issue where it can't create a thread. Apparently Java likes to throw OutOfMemoryError when it runs out of other OS resources, like threads.

When you run ulimit -a to get the thread limits, is it running in exactly the same environment as Toil's jobs are? Can you schedule it (or a Bash script that runs it, since it is a Bash builtin) through the queue? It's possible that your Grid Engine admins have worked around https://stackoverflow.com/q/37386687 and are trying to use ulimits to restrict jobs from using more than the number of cores they actually ask for, by setting them from the scheduler. Since the default CPU requirement for the KTServer jobs in Cactus seems to default to 0.1 (i.e. 10% of a core), that might not be working well here.

You should try upping the cpu attribute of the ktserver element in the Cactus XML config file (as accessed here). I think you have to copy and modify here and then pass with a --configFile option to Cactus.

You could also maybe check if ulimit -a run via Grid Engine reports a set but large stack size limit. Apparently that can make pthreads over-allocate memory.

In terms of instrumenting Cactus, it looks like your modification to the command didn't take:

WARNING:toil.leader:4/n/job2cy6_X    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '19304', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-906375ca-4612-4b9c-aace-ad64276eee7f/tmpgECj07/f55c3afe-5488-482b-aef1-ec94770f96a6/tMWAk0n/snapshot', '-bgsc', 'lzo', '-bgsi', '1000000', '-log', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-94ea01d0-17fb-4275-ab57-79c4fdfad9a5-906375ca-4612-4b9c-aace-ad64276eee7f/tmpgECj07/f55c3afe-5488-482b-aef1-ec94770f96a6/tmpBPs61G.tmp', ':#opts=ls#bnum=30m#msiz=50g#ktopts=p']

It should have worked; are you sure the modified Cactus sources were actually installed?

Also, you probably want to stick a catch throw in there, so GDB will stop where the exception is generated and not when it figures out that it won't be caught:

return ['gdb', cmd[0], '-ex', 'catch throw', '-ex', 'run ' + " ".join(cmd[1:]), '-ex', 'bt', '-ex', 'quit 1']

If you can get it to actually run the command, but still can't see the output in the Toil log, you could try passing outfile="somefilename.txt" here and then read that file's contents and cram them in the error message as with the server log here.

The ulimit for stack size is set to unlimited and for threads is set to around 500k. I checked through a script submitted by qsub to the grid engine queue that I'm using.

The gdb modifications not working was my fault. I forgot to recompile and reinstall Cactus via Pip after changing that line of code. I've inserted a catch throw and reran and I get different errors this time, including a gdb stack trace:

WARNING:toil.leader:3/X/job_e_HFr    INFO:cactus.shared.common:Running the command ['netstat', '-tuplen']
WARNING:toil.leader:3/X/job_e_HFr    (No info could be read for "-p": geteuid()=98354 but you should be root.)
WARNING:toil.leader:3/X/job_e_HFr    INFO:cactus.shared.common:Running the command ['gdb', 'ktserver', '-ex', 'catch throw', '-ex', u'run -port 4297 -ls -tout 200000 -th 64 -bgs /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-71bce680-8ad2-437e-84a1-02f7ed948ac7-2bc2432e-0437-4e28-8030-69bdeb17d728/tmpwAZsQI/b7208c4c-28f3-4ffa-a2e5-78588fb2ba76/t6aN4Io/snapshot -bgsc lzo -bgsi 1000000 -log /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-71bce680-8ad2-437e-84a1-02f7ed948ac7-2bc2432e-0437-4e28-8030-69bdeb17d728/tmpwAZsQI/b7208c4c-28f3-4ffa-a2e5-78588fb2ba76/tmp7SU8Ye.tmp :#opts=ls#bnum=30m#msiz=50g#ktopts=p', '-ex', 'bt', '-ex', 'quit 1']
WARNING:toil.leader:3/X/job_e_HFr    GNU gdb (GDB) 8.2
WARNING:toil.leader:3/X/job_e_HFr    Copyright (C) 2018 Free Software Foundation, Inc.
WARNING:toil.leader:3/X/job_e_HFr    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
WARNING:toil.leader:3/X/job_e_HFr    This is free software: you are free to change and redistribute it.
WARNING:toil.leader:3/X/job_e_HFr    There is NO WARRANTY, to the extent permitted by law.
WARNING:toil.leader:3/X/job_e_HFr    Type "show copying" and "show warranty" for details.
WARNING:toil.leader:3/X/job_e_HFr    This GDB was configured as "x86_64-pc-linux-gnu".
WARNING:toil.leader:3/X/job_e_HFr    Type "show configuration" for configuration details.
WARNING:toil.leader:3/X/job_e_HFr    For bug reporting instructions, please see:
WARNING:toil.leader:3/X/job_e_HFr    <http://www.gnu.org/software/gdb/bugs/>.
WARNING:toil.leader:3/X/job_e_HFr    Find the GDB manual and other documentation resources online at:
WARNING:toil.leader:3/X/job_e_HFr        <http://www.gnu.org/software/gdb/documentation/>.
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    For help, type "help".
WARNING:toil.leader:3/X/job_e_HFr    Type "apropos word" to search for commands related to "word"...
WARNING:toil.leader:3/X/job_e_HFr    Reading symbols from ktserver...(no debugging symbols found)...done.
WARNING:toil.leader:3/X/job_e_HFr    Catchpoint 1 (throw)
WARNING:toil.leader:3/X/job_e_HFr    Starting program: /export/home/ncit/external/a.mizeranschi/progressiveCactus/submodules/kyototycoon/bin/ktserver -port 4297 -ls -tout 200000 -th 64 -bgs /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-71bce680-8ad2-437e-84a1-02f7ed948ac7-2bc2432e-0437-4e28-8030-69bdeb17d728/tmpwAZsQI/b7208c4c-28f3-4ffa-a2e5-78588fb2ba76/t6aN4Io/snapshot -bgsc lzo -bgsi 1000000 -log /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-71bce680-8ad2-437e-84a1-02f7ed948ac7-2bc2432e-0437-4e28-8030-69bdeb17d728/tmpwAZsQI/b7208c4c-28f3-4ffa-a2e5-78588fb2ba76/tmp7SU8Ye.tmp :#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:3/X/job_e_HFr    [Thread debugging using libthread_db enabled]
WARNING:toil.leader:3/X/job_e_HFr    Using host libthread_db library "/lib64/libthread_db.so.1".
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabc622700 (LWP 5989)]
WARNING:toil.leader:3/X/job_e_HFr    [Thread 0x2aaabc622700 (LWP 5989) exited]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabc823700 (LWP 5990)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabca24700 (LWP 5991)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabcc25700 (LWP 5992)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabce26700 (LWP 5993)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabd027700 (LWP 5995)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabd228700 (LWP 5996)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabd429700 (LWP 5997)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabd62a700 (LWP 5998)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabd82b700 (LWP 5999)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabda2c700 (LWP 6000)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabdc2d700 (LWP 6001)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabde2e700 (LWP 6002)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabe02f700 (LWP 6003)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabe230700 (LWP 6004)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabe431700 (LWP 6005)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabe632700 (LWP 6006)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabe833700 (LWP 6007)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabea34700 (LWP 6008)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabec35700 (LWP 6009)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabee36700 (LWP 6010)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabf037700 (LWP 6011)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabf238700 (LWP 6012)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabf439700 (LWP 6013)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabf63a700 (LWP 6014)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabf83b700 (LWP 6015)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabfa3c700 (LWP 6016)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabfc3d700 (LWP 6017)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aaabfe3e700 (LWP 6018)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30200700 (LWP 6019)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30401700 (LWP 6020)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30602700 (LWP 6021)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30803700 (LWP 6022)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30a04700 (LWP 6023)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30c05700 (LWP 6024)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab30e06700 (LWP 6025)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31007700 (LWP 6026)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31208700 (LWP 6027)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31409700 (LWP 6028)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab3160a700 (LWP 6029)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab3180b700 (LWP 6030)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31a0c700 (LWP 6031)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31c0d700 (LWP 6032)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab31e0e700 (LWP 6033)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab3200f700 (LWP 6034)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32210700 (LWP 6035)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32411700 (LWP 6036)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32612700 (LWP 6037)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32813700 (LWP 6038)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32a14700 (LWP 6039)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32c15700 (LWP 6040)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab32e16700 (LWP 6041)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab33017700 (LWP 6042)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab33218700 (LWP 6043)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab33419700 (LWP 6044)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab3361a700 (LWP 6045)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab3381b700 (LWP 6046)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab33a1c700 (LWP 6047)]
WARNING:toil.leader:3/X/job_e_HFr    [New Thread 0x2aab33c1d700 (LWP 6048)]
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    Thread 1 "ktserver" hit Catchpoint 1 (exception thrown), 0x00002aaaab264b2d in __cxa_throw () from /lib64/libstdc++.so.6
WARNING:toil.leader:3/X/job_e_HFr    #0  0x00002aaaab264b2d in __cxa_throw () from /lib64/libstdc++.so.6
WARNING:toil.leader:3/X/job_e_HFr    #1  0x00002aaaaad14257 in kyotocabinet::Thread::start() ()
WARNING:toil.leader:3/X/job_e_HFr       from /export/home/ncit/external/a.mizeranschi/progressiveCactus/submodules/kyotocabinet/lib/libkyotocabinet.so.16
WARNING:toil.leader:3/X/job_e_HFr    #2  0x000000000044f29e in kyotocabinet::TaskQueue::start(unsigned long) ()
WARNING:toil.leader:3/X/job_e_HFr    #3  0x00000000004ea977 in kyototycoon::ThreadedServer::start() ()
WARNING:toil.leader:3/X/job_e_HFr    #4  0x00000000004322dc in proc(std::vector<std::string, std::allocator<std::string> > const&, char const*, int, double, int, char const*, unsigned int, char const*, long, double, int, int, double, bool, char const*, double, kyotocabinet::Compressor*, bool, char const*, char const*, char const*, char const*, int, char const*, double, char const*, char const*, char const*) ()
WARNING:toil.leader:3/X/job_e_HFr    #5  0x000000000042a050 in main ()
WARNING:toil.leader:3/X/job_e_HFr    A debugging session is active.
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr       Inferior 1 [process 5985] will be killed.
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    Quit anyway? (y or n) [answered Y; input not from terminal]
WARNING:toil.leader:3/X/job_e_HFr    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:3/X/job_e_HFr    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:3/X/job_e_HFr    INFO:toil.lib.bioio:Ktserver running.
WARNING:toil.leader:3/X/job_e_HFr    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'get', '-port', '4297', '-host', '172.16.13.38', 'TERMINATE']
WARNING:toil.leader:3/X/job_e_HFr    Process ServerProcess-1:
WARNING:toil.leader:3/X/job_e_HFr    Traceback (most recent call last):
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
WARNING:toil.leader:3/X/job_e_HFr        self.run()
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:3/X/job_e_HFr        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:3/X/job_e_HFr        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:3/X/job_e_HFr    RuntimeError: KTServer failed. Log: 2019-03-15T01:49:41.766586+02:00: [SYSTEM]: ================ [START]: pid=5985
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.766769+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768716+02:00: [SYSTEM]: starting the server: expr=:4297
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768834+02:00: [SYSTEM]: server socket opened: expr=:4297 timeout=200000.0
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768861+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'set', '-port', '4297', '-host', '172.16.13.38', 'TERMINATE', '1']
WARNING:toil.leader:3/X/job_e_HFr    ktremotemgr: DB::open failed: : 6: network error: connection failed
WARNING:toil.leader:3/X/job_e_HFr    Traceback (most recent call last):
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:3/X/job_e_HFr        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:3/X/job_e_HFr        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1694, in _run
WARNING:toil.leader:3/X/job_e_HFr        returnValues = self.run(fileStore)
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1673, in run
WARNING:toil.leader:3/X/job_e_HFr        if not service.check():
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverToil.py", line 55, in check
WARNING:toil.leader:3/X/job_e_HFr        raise RuntimeError(msg)
WARNING:toil.leader:3/X/job_e_HFr    RuntimeError: Traceback (most recent call last):
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 82, in run
WARNING:toil.leader:3/X/job_e_HFr        self.tryRun(*self.args, **self.kwargs)
WARNING:toil.leader:3/X/job_e_HFr      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 118, in tryRun
WARNING:toil.leader:3/X/job_e_HFr        raise RuntimeError("KTServer failed. Log: %s" % f.read())
WARNING:toil.leader:3/X/job_e_HFr    RuntimeError: KTServer failed. Log: 2019-03-15T01:49:41.766586+02:00: [SYSTEM]: ================ [START]: pid=5985
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.766769+02:00: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768716+02:00: [SYSTEM]: starting the server: expr=:4297
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768834+02:00: [SYSTEM]: server socket opened: expr=:4297 timeout=200000.0
WARNING:toil.leader:3/X/job_e_HFr    2019-03-15T01:49:41.768861+02:00: [SYSTEM]: listening server socket started: fd=4
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    
WARNING:toil.leader:3/X/job_e_HFr    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn38.grid.pub.ro
WARNING:toil.leader:3/X/job_e_HFr    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' 3/X/job_e_HFr with ID 3/X/job_e_HFr to 0
WARNING:toil.leader:Job 'KtServerService' 3/X/job_e_HFr with ID 3/X/job_e_HFr is completely failed
INFO:toil.leader:Finished toil run with 10 failed jobs.
INFO:toil.leader:Failed jobs at end of the run: 'ProgressiveDown' q/x/jobCV4JEu 'CactusSetupPhase' T/z/jobBha6Yy 'ProgressiveUp' 0/d/jobklSqTp 'CactusSetupCheckpoint' z/p/jobKUHwGl 'CactusTrimmingBlastPhase' g/z/joboWXHKb 'KtServerService' 3/X/job_e_HFr 'RunCactusPreprocessorThenProgressiveDown2' v/3/jobIgI2VQ 'RunCactusPreprocessorThenProgressiveDown' M/m/job_ULHfN 'ProgressiveNext' H/5/jobLjT87d 'StartPrimaryDB' G/j/jobzdnldr
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/bin/cactus", line 11, in <module>
    sys.exit(main())
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 520, in main
    halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/common.py", line 784, in start
    return self._runMainLoop(rootJobGraph)
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/common.py", line 1059, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/leader.py", line 237, in run
    raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException

OK, I've instrumented the failing function in a fork of Kyoto Cabinet: https://github.com/adamnovak/kyotocabinet

Can you go back to the original Cactus (so the exception message will appear) and replace your Kyoto Cabinet library with the version I patched? It should get loaded by Kyoto Tycoon, and do some investigating of the environment when threads are created, and report a bunch of output like:

Stack limits: 8388608 Infinity
Process limits: 4096 4128203
Kernel thread cap: 8256406
Current total threads running: 2479

It should also explain more clearly what goes wrong when pthread_create fails. Also, it sets a stack size for the created threads, so if the default is somehow implausibly large, the threads will still end up with reasonable stack sizes.

Thanks for still looking into this.

I recompiled Cactus with the modified version of Kyoto Cabinet and I could then see those output messages (Stack limits, Process limits etc.) in the logs.

There are two occurrences of the following:

WARNING:toil.leader:3/R/jobQbZW4w    Current total threads running: 572
WARNING:toil.leader:3/R/jobQbZW4w    terminate called after throwing an instance of 'std::runtime_error'
WARNING:toil.leader:3/R/jobQbZW4w      what():  pthread_create: EAGAIN: Insufficient resources to create thread, or thread limit reached

Any idea what could cause this? The value from /proc/sys/kernel/threads-max seems like it should allow for a lot more threads to be running.

$ cat /proc/sys/kernel/threads-max
1022192

Here's the full output of this first test that I've run: https://pastebin.com/MG1skiRG.

I saw people mentioning here and here that it could be worth setting a stack size limit (ulimit -s) to something like 8192. It was originally set to unlimited.

I gave this a try, but it didn't help. I couldn't see the Kyoto Cabinet debug messages and the pthread_create error after setting the stack size limit.

Here is the full output of this second test that I've run: https://pastebin.com/nEeQqzPB.

Setting the stack ulimit shouldn't help with my patched library; I (try to) set the thread stack sizes in it. And it shouldn't work badly with unlimited either, I don't think.

It looks like when you limit the stack size it breaks cPecanLastz, which I think runs before the step that has been failing. So that's why there's no debug output.

It looks like, when Kyoto Tycoon does run, it is always managing to create 60 threads before running out of resources to create more (whatever those resources are). The server is being started with -th 64 which is hardcoded here. Also, you are passing --maxCores 32 to the workflow as a whole, and the server appears to only be asking for 0.1 cores as its job requirements, so 640 threads is probably excessive.

Try setting -th 32 on that line in the Cactus code. We've seen it successfully start 60 threads before hitting whatever limit it's hitting in your environment, so it should be able to start 32 just fine. Hopefully that will work, and then you can start taking out the other workarounds to get a minimal change.

OK, I've changed line 201 in the file cactus/src/cactus/pipeline/ktserverControl.py to

 serverOptions = "-ls -tout 200000 -th 32"

and recompiled and reinstalled Cactus via pip. I did have to make one extra change compared to before, and that was to download and compile zlib-1.2.11 and add it to LIBRARY_PATH, as this wasn't available on the cluster node where I could do the compiling (and the nodes that I initially used for compiling stuff aren't available right now).

I then ran Cactus and all of the Kyoto Cabinet debug messages didn't get printed anymore, while the 'KtServerService' jobs keep failing. I also couldn't see the line INFO:cactus.shared.common:Running the command ['ktserver', '-port', ...] anywhere.

Could the zlib version have made such a big difference here, or is there something else entirely going on?

The full output is here: https://pastebin.com/Kmrzg3nR.

To see whether the compilation environment (incl. the zlib version) makes a difference, I've recompiled both progressiveCactus and Cactus from the same cluster node as before, using the local zlib install I mentioned above. The queue nodes that I was previously using for compiling seem to be in heavy use, so I'm not likely to get access to reproduce things there any time soon.

I got the progressiveCactus sources from Git, installed the Kyoto Cabinet version that you modified into the submodules directory, compiled everything and then sourced the progressiveCactus/environment file. I then downloaded and compiled the newer version of Cactus, without any modifications to the code.

I ran a test with this and, again, I'm not seeing the Kyoto Cabinet output messages (Stack limits, Process limits etc.) in the logs. What could have happened to cause this? Would the compilation environment have such an impact?

The full output is here: https://pastebin.com/uajrJYkt.

It looks like both the logs you posted show the step where cPecanLastz gets run failing, something like:

WARNING:toil.leader:x/i/job8eXgYy    INFO:cactus.shared.common:Running the command ['cPecanLastz', '--format=cigar', '--notrivial', '--step=1', '--ambiguous=iupac,100,100', '--ydrop=3000', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-c4cbf429-5780-4537-8bef-fbab73d6b48a-88711676-5948-47d0-acd3-569974301115/tmp7ZhAPX/41e45cc5-a82e-463d-b76f-de9aea6b94a8/tmpXS08uE.tmp[multiple][nameparse=darkspace]', u'/export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-c4cbf429-5780-4537-8bef-fbab73d6b48a-88711676-5948-47d0-acd3-569974301115/tmp7ZhAPX/41e45cc5-a82e-463d-b76f-de9aea6b94a8/tmprfGrSm.tmp[nameparse=darkspace]']
WARNING:toil.leader:x/i/job8eXgYy    FAILURE: call to malloc failed to allocate 72,242,864 bytes, for new_position_table

The job isn't being given enough memory, either because Cactus is not asking for enough memory for it when scheduling the job in Toil, or because Toil is not properly sending the memory requirement to SGE, or because when SGE schedules the job the system doesn't actually have the RAM available that SGE promised.

I think this is all happening before Kyoto Tycoon gets started at all, which explains why you're seeing the same results no matter what zlib version/Kyoto Cabinet/whatever you install.

Can you go back to exactly your initial configuration when you first reported the bug, and get the original KtServerService failure message you initially reported? If so, you should be able to, on top of that, apply the -th 32 change to solve that problem. If you go back to exactly your initial configuration and get errors about cPecanLastz not being able to allocate memory instead, then something out of your control must have changed in your environment (perhaps cluster load?).

To actually attack cPecanLastz running out of memory, you could try passing --defaultMemory 32G or some other large value to Cactus. I think the alignment jobs are here and here not successfully working out their input file sizes, and are then not setting a required memory amount, and so falling back on the "default" memory requirement for jobs that don't specify a requirement, which is 2 GB unless you change it. Perhaps scheduling the jobs with 2 GB memory in Grid Engine used to work, but now the cluster is under higher load and there's no extra memory to be had beyond what is actually requested.

The cluster queue that I'm using to run the test doesn't seem to be under load. Only 8 cores out of the total of 448 (14 nodes x 32 cores) have been used in the last couple of days. That should leave 13 nodes completely unused.

For now, at least, I can't go back to the original configuration, because I can't access the cluster nodes that I used previously for compiling, as the corresponding queue has been unavailable (under use) for the past couple of days. To be more precise, the cluster that I use has multiple queues with different sets of nodes, and only two sets of them (two queues) have the necessary packages for compiling stuff. With the other queues I got errors such as /usr/bin/ld: cannot find -lstdc++ during compiling.

The last test that I posted before this, a few hours ago, involved recompiling everything from scratch, with no other changes apart from the ones mentioned there:

different node used for compiling compared to the one I used when getting the ktserver errors
zlib-1.2.11 compiled and linked locally (ld was complaining about not finding -lz on the new nodes, but didn't on the ones I used previously)
the Kyoto Cabinet version with your modifications was used for compiling progressiveCactus.

Do you have a clue about how (or if) any of those changes could be causing these new cPecanLastz memory errors?

I'll try increasing the --defaultMemory to see if this helps and I'll report back about this tomorrow.

None of those changes sound like they should make cPecanLastz run out of memory to me. It might be that the compiler version you are using now is different, and it is producing less/differently optimized code that ends up using just enough more memory to make it run out now when it didn't before, but that's a long shot.

On 3/19/19, amizeranschi notifications@github.com wrote:

The cluster queue that I'm using to run the test doesn't seem to be under load. Only 8 cores out of the total of 448 (14 nodes x 32 cores) have been used in the last couple of days. That should leave 13 nodes completely unused.

For now, at least, I can't go back to the original configuration, because I can't access the cluster nodes that I used previously for compiling, as the corresponding queue has been unavailable (under use) for the past couple of days. To be more precise, the cluster that I use has multiple queues with different sets of nodes, and only two sets of them (two queues) have the necessary packages for compiling stuff. With the other queues I got errors such as /usr/bin/ld: cannot find -lstdc++ during compiling.

The last test that I posted before this, a few hours ago, involved recompiling everything from scratch, with no other changes apart from the ones mentioned there:

different node used for compiling compared to the one I used when getting the ktserver errors

zlib-1.2.11 compiled and linked locally (ld was complaining about not finding -lz on the new nodes, but didn't on the ones I used previously)

the Kyoto Cabinet version with your modifications was used for compiling progressiveCactus.

Do you have a clue about how (or if) any of those changes could be causing these new cPecanLastz memory errors?

I'll try increasing the --defaultMemory to see if this helps and I'll report back about this tomorrow.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/63#issuecomment-474541426

I added the option --defaultMemory 30G and it looks like it did make a difference. I still saw lots of memory errors, which I assume are related to this other issue: https://github.com/ComparativeGenomicsToolkit/cactus/issues/52, where the --defaultMemory option seems to get ignored for some job submissions including RunBlast.

However, things progressed further this time and the ktserver process managed to create 64 threads. In the end, the whole thing still crashed and I also saw some new error messages such as:

WARNING:toil.leader:j/V/jobshfOQv    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'set', '-port', '32149', '-host', '172.16.13.41', 'TERMINATE', '1']
WARNING:toil.leader:j/V/jobshfOQv    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'get', '-port', '32149', '-host', '172.16.13.41', 'TERMINATE']
WARNING:toil.leader:j/V/jobshfOQv    1
WARNING:toil.leader:j/V/jobshfOQv    Traceback (most recent call last):
WARNING:toil.leader:j/V/jobshfOQv      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:j/V/jobshfOQv        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:j/V/jobshfOQv      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:j/V/jobshfOQv        returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:j/V/jobshfOQv      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1694, in _run
WARNING:toil.leader:j/V/jobshfOQv        returnValues = self.run(fileStore)
WARNING:toil.leader:j/V/jobshfOQv      File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/job.py", line 1668, in run
WARNING:toil.leader:j/V/jobshfOQv        raise RuntimeError("Detected the error jobStoreID has been removed so exiting with an error")
WARNING:toil.leader:j/V/jobshfOQv    RuntimeError: Detected the error jobStoreID has been removed so exiting with an error
WARNING:toil.leader:j/V/jobshfOQv    ERROR:toil.worker:Exiting the worker because of a failed job on host haswell-wn41.grid.pub.ro

The full output is here: https://pastebin.com/hRSnKUQB.

You definitely do seem to have problems with jobs being kicked off with 100 MB of memory. Maybe the code at https://github.com/ComparativeGenomicsToolkit/cactus/blob/450da744ae7375dd2b5cfb4b304fdf582beb62e3/src/cactus/blast/blast.py#L426 actually is running, but the input files are very small, so you get a small size that is rounded up to 100 MB, which is insufficient?

Try patching here and here to add a few billion bytes of memory to be required independent of the input file sizes. Also try upping the ~2.5GB minimum memory value in here for the KTServer to something larger.

You also have:

INFO:toil.leader:Issued job 'KtServerService' l/4/jobKX6VB_ with job batch system ID: 155 and cores: 0, disk: 2.0 G, and memory: 2.3 G
INFO:toil.leader:Issued job 'CactusSetupPhase' 4/h/jobVNCJKg with job batch system ID: 156 and cores: 1, disk: 2.0 G, and memory: 3.3 G
WARNING:toil.leader:Job failed with exit value 137: 'KtServerService' l/4/jobKX6VB_
WARNING:toil.leader:No log file is present, despite job failing: 'KtServerService' l/4/jobKX6VB_

That looks like your job is getting a SIGKILL sent to it somehow (causing exit code 137). That would explain the absence of a log file (because the Toil worker itself is getting killed before uploading the log). I suspect you're triggering the OOM killer somehow. Maybe the OOM killer (or something similar) starts killing your Toil processes when they hit their memory limits according to Grid Engine?

The bit of the log you pulled out looks to be the KTServerJob shutting down because all of Toil is shutting down, because of an error encountered in some other job.

Thanks for still looking into this. I'll give those patches a try and see if they help.

Regarding the messages RuntimeError: Detected the error jobStoreID has been removed so exiting with an error, I'm seeing several of these during the run, on lines 1171, 1672, 2168 etc. in the output. They appear every time after the ktserver and ktremotemgr failures.

What did you mean by "when all of Toil is shutting down"? Is Toil also shutting down and restarting multiple times during the Cactus run?

I think what is happening is that your server job is failing to start, getting more memory assigned to it, and then starting successfully on the second try. But other jobs in the workflow are failing, so the Toil master decides to stop the whole workflow, which means it has to shut down the running server job. It does this by sending a signal to it by removing a file from the job store, which the Toil worker responsible for the server job sees, which makes it throw that error in order to shut itself down.

On 3/20/19, amizeranschi notifications@github.com wrote:

Thanks for still looking into this. I'll give those patches a try and see if they help.

Regarding the messages RuntimeError: Detected the error jobStoreID has been removed so exiting with an error, I'm seeing several of these during the run, on lines 1171, 1672, 2168 etc. in the output. They appear every time after the ktserver and ktremotemgr failures.

What did you mean by "when all of Toil is shutting down"? Is Toil also shutting down and restarting multiple times during the Cactus run?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/63#issuecomment-474975164

Regarding the small file inputs for RunBlast, that sounds spot on. I'm testing things with Cactus' evolverMammals example, which uses 5 small FASTA files (simCow.chr6, simDog.chr6 etc.). Each file takes around 600 KB of space.

I got access to the old nodes where I used to compile things, so I went back and recompiled progressiveCactus (with the modified Kyoto Cabinet) and Cactus from there. Those machines have the same GCC versions as the ones I've used in the past couple of days for compiling:

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

I've added the following bit of code + 5 * 1024 * 1024 * 1024 at the end of the lines:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/450da744ae7375dd2b5cfb4b304fdf582beb62e3/src/cactus/blast/blast.py#L426

and

https://github.com/ComparativeGenomicsToolkit/cactus/blob/450da744ae7375dd2b5cfb4b304fdf582beb62e3/src/cactus/blast/blast.py#L393

I've also changed the value 2500000000 to 10000000000 (10 GB) on the following line:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/466b5bb8727576d2f58c82222e71f1f2da65404e/src/cactus/pipeline/cactus_workflow.py#L188

After recompiling Cactus, I ran another test with the evolverMammals input files.

This particular test got stuck in some sort of a message loop. This is a relatively rare occurrence, which is why I haven't reported it yet, but I've had it happen a few times before, ever since I started testing Cactus.

Basically, at some stage during the analysis (I think I've seen it happen at different stages), some Toil worker appears to get stuck in a message spree (several messages per second) and the overall job stops progressing from that moment on.

I canceled the job, the moment I noticed the message spree. Here's part of the output (to keep it under 512 KB for pastebin): https://pastebin.com/Dn7ZZwPk.

I also saw some other messages scattered between the neverending ones, which suggests that other jobs may have kept going. However, I wouldn't want to leave the job running after those messages start appearing over and over.

[...]
INFO:toil.leader:Job ended successfully: 'logAssemblyStats' C/h/jobixfkF9
INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:49: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:49: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.leader:Job ended successfully: 'logAssemblyStats' d/u/job0MEcOB
INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:50: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.leader:Job ended successfully: 'ProgressiveDown' a/7/jobxNaBa2
INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:51: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

INFO:toil.leader:Issued job 'ProgressiveNext' D/L/jobtHeuaH with job batch system ID: 90 and cores: 1, disk: 2.0 G, and memory: 3.3 G
INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 15:00:51: After preprocessing, got assembly stats for genome simMouse_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-59304c1f-da00-4705-8393-061318b6effa-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpJR1ln1/26a8498d-ad6f-404f-925a-83d8589e5868/tmpuVeBAB.tmp Total-sequences: 1 Total-length: 636262 Proportion-repeat-masked: 0.056477 ProportionNs: 0.000000 Total-Ns: 0 N50: 636262 Median-sequence-length: 636262 Max-sequence-length: 636262 Min-sequence-length: 636262

I resubmitted the job, and the message loop happened again, at the same step, the only difference being that this time the message was referring a different input file (simRat_chr6):

INFO:toil.statsAndLogging:Got message from job at time 03-21-2019 16:03:58: Before preprocessing, got assembly stats for genome simRat_chr6: Input-sample: /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp/toil-80892356-ece5-40db-aab7-a884cafc14d1-ee32135c-bc45-4f9b-bd5f-12666414cf0b/tmpyfsLD3/9c8f168d-8d75-495c-b4d5-5c77f3f4896b/tmpIMAbGv.tmp Total-sequences: 1 Total-length: 647215 Proportion-repeat-masked: 0.075276 ProportionNs: 0.000000 Total-Ns: 0 N50: 647215 Median-sequence-length: 647215 Max-sequence-length: 647215 Min-sequence-length: 647215

After a third attempt, the job managed to pass that stage successfully, without starting the message spree.

This time, the past job failures didn't occur, for the most part. One exception was the following:

INFO:toil.leader:Issued job 'Job' 6/Y/joba1RVOo with job batch system ID: 144 and cores: 0, disk: 1.0 M, and memory: 32.0 M
WARNING:toil.leader:Job failed with exit value 1: 'Job' 6/Y/joba1RVOo
WARNING:toil.leader:No log file is present, despite job failing: 'Job' 6/Y/joba1RVOo
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'Job' 6/Y/joba1RVOo with ID 6/Y/joba1RVOo to 5
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'Job' 6/Y/joba1RVOo to 32212254720 bytes
INFO:toil.leader:Issued job 'Job' 6/Y/joba1RVOo with job batch system ID: 145 and cores: 0, disk: 1.0 M, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'Job' 6/Y/joba1RVOo

I'm guessing this job has the same problem as RunBlast had earlier -- it first gets submitted with a 32M memory request, due to the small size of the input files.

I still got errors with jobs failing with exit code 137 and Toil workers shutting down

$ grep "137" /export/home/ncit/external/a.mizeranschi/temp/cactus-test/runCactusTest.sh.e1073141
INFO:toil.leader:Issued job 'BlastFirstOutgroup' b/M/jobsmlT6L with job batch system ID: 137 and cores: 1, disk: 2.0 G, and memory: 2.0 G
WARNING:toil.leader:Job failed with exit value 137: 'KtServerService' y/w/jobl33AQY
WARNING:toil.leader:Job failed with exit value 137: 'CactusCafWrapper' k/3/jobfOZz59
WARNING:toil.leader:Job failed with exit value 137: 'KtServerService' W/K/jobv65jYM
WARNING:toil.leader:Job failed with exit value 137: 'KtServerService' w/5/jobiOcu5l

However, those jobs seemed to end successfully after getting reissued:

$ grep "KtServerService" /export/home/ncit/external/a.mizeranschi/temp/cactus-test/runCactusTest.sh.e1073141
INFO:toil.leader:Issued job 'KtServerService' y/w/jobl33AQY with job batch system ID: 151 and cores: 0, disk: 2.0 G, and memory: 2.3 G
WARNING:toil.leader:Job failed with exit value 137: 'KtServerService' y/w/jobl33AQY
WARNING:toil.leader:No log file is present, despite job failing: 'KtServerService' y/w/jobl33AQY
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' y/w/jobl33AQY with ID y/w/jobl33AQY to 5
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'KtServerService' y/w/jobl33AQY to 32212254720 bytes
INFO:toil.leader:Issued job 'KtServerService' y/w/jobl33AQY with job batch system ID: 153 and cores: 0, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'KtServerService' y/w/jobl33AQY

$ grep "CactusCafWrapper" /export/home/ncit/external/a.mizeranschi/temp/cactus-test/runCactusTest.sh.e1073141
        <CactusCafWrapper maxFlowerGroupSize="25000000" minFlowerSize="1" />
        <CactusCafWrapperLarge2 overlargeMemory="5000000000" />
INFO:toil.leader:Issued job 'CactusCafWrapper' k/3/jobfOZz59 with job batch system ID: 156 and cores: 1, disk: 2.0 G, and memory: 200.0 M
WARNING:toil.leader:Job failed with exit value 137: 'CactusCafWrapper' k/3/jobfOZz59
WARNING:toil.leader:No log file is present, despite job failing: 'CactusCafWrapper' k/3/jobfOZz59
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusCafWrapper' k/3/jobfOZz59 with ID k/3/jobfOZz59 to 5
WARNING:toil.jobGraph:We have increased the default memory of the failed job 'CactusCafWrapper' k/3/jobfOZz59 to 32212254720 bytes
INFO:toil.leader:Issued job 'CactusCafWrapper' k/3/jobfOZz59 with job batch system ID: 157 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusCafWrapper' k/3/jobfOZz59

What worries me is that I still see those messages Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 172.16.13.35 with error: network error, which I mentioned when I first started this thread.

This time, I'm also seeing errors in the CactusBarRecursion phase. I think this is the first time any of my tests got this far. The job is still running, but I'm guessing it will eventually crash:

$ grep "CactusBarRecursion" /export/home/ncit/external/a.mizeranschi/temp/cactus-test/runCactusTest.sh.e1073141
        <CactusBarRecursion maxFlowerGroupSize="100000000" />
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 167 and cores: 1, disk: 2.0 G, and memory: 2.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 5
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:We have increased the default memory of the failed job 'CactusBarRecursion' f/j/jobTGunlv to 32212254720 bytes
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 168 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 4
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 169 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 3
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 170 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 2
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 171 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 1
INFO:toil.leader:Issued job 'CactusBarRecursion' f/j/jobTGunlv with job batch system ID: 172 and cores: 1, disk: 2.0 G, and memory: 30.0 G
INFO:toil.leader:Job ended successfully: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'CactusBarRecursion' f/j/jobTGunlv
WARNING:toil.leader:f/j/jobTGunlv    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv to 0
WARNING:toil.leader:Job 'CactusBarRecursion' f/j/jobTGunlv with ID f/j/jobTGunlv is completely failed

WARNING:toil.leader:Job 'KtServerService' W/K/jobv65jYM with ID W/K/jobv65jYM is completely failed
WARNING:toil.leader:Job: F/N/jobjE7L7P is being restarted as a checkpoint after the total failure of jobs in its subtree.
INFO:toil.leader:Issued job 'StartPrimaryDB' F/N/jobjE7L7P with job batch system ID: 173 and cores: 1, disk: 2.0 G, and memory: 3.3 G

It looks like I was right and the job eventually crashed. The full output is here: https://pastebin.com/FREN7eXE.

INFO:toil.leader:Finished toil run with 12 failed jobs.
INFO:toil.leader:Failed jobs at end of the run: 'CactusBarCheckpoint' W/3/job2iTJmv 'ProgressiveUp' x/7/jobZa6Q4d 'CactusBarPhase' a/L/jobFubjwb 'StartPrimaryDB' F/N/jobjE7L7P 'CactusSetupCheckpoint' 0/n/job78s4Sf 'ProgressiveNext' I/Z/jobcxEu4q 'ProgressiveDown' A/O/jobcUnhLZ 'CactusBarRecursion' C/U/jobAqFEMj 'RunCactusPreprocessorThenProgressiveDown' 2/3/jobIvGmLa 'RunCactusPreprocessorThenProgressiveDown2' K/E/jobZCdRhw 'CactusTrimmingBlastPhase' 3/j/job5LKEDg 'KtServerService' D/6/jobs_kAaE

Could those jobs also be failing because of too little memory being allocated?

Jobs with failed descendants will also show up as failed jobs at the end; it's not clear what the lowest-level failed job is here.

@glennhickey or maybe @joelarmstrong might have some better advice than me about how to adapt the whole pipeline to deal with such tiny files as you are using. There might be other places where you have to impose a minimum memory requirement.

I'm not sure why jobs are unable to contact the database server even though it is getting re-started. Probably it is because it comes up fine, but crashes when it starts getting sent data to store, due to hitting the memory limit and being killed. Then it gets restarted, but either dependent jobs fail in its absence, or it gets restarted on a different host and port and isn't accessible, so it doesn't actually fix anything to restart it. You have to make sure it doesn't fail in the first place.

Your edits to the KTServer memory calculation must not have taken; it's still being started with 2.3GB of memory, and not the minimum 10 GB you assigned:

INFO:toil.leader:Issued job 'KtServerService' y/w/jobl33AQY with job batch system ID: 151 and cores: 0, disk: 2.0 G, and memory: 2.3 G

As for the message loop, can you report that as a separate bug? I think it's coming from here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/157ed0cca83ff56b42fd216d9f95011620253df2/src/cactus/progressive/cactus_progressive.py#L202-L205

I guess that job is getting issued once for each something in the assembly, of which there are a lot, and is being artificially inflated in size by the default requirements we're specifying, which is why the whole workflow stalls out while all these jobs reserve loads of resources to do very little. If you don't want the stats it is computing, I think you can just knock it out here.

Regarding the message loop, I opened a new thread here: https://github.com/ComparativeGenomicsToolkit/cactus/issues/66.

I'm not sure if commenting out the bit of code here will solve this, because I vaguely remember seeing them for other messages as well, at other stages of the analysis. I'm guessing that this kind of issue can occur whenever a message is being passed (INFO:toil.statsAndLogging:Got message from job at time [...]. The fact that it doesn't happen on every run makes it harder to investigate.

Regarding adapting the pipeline to work with small files in the distributed scenario, I'm not sure if this should be a priority, but I did expect it to work "out of the box". I also ran the test with the small files locally, i.e. WITHOUT the --batchSystem gridEngine so that Cactus runs on a single SGE cluster node, as such:

cactus --binariesMode local cactusWork evolverMammals-offline.txt evolverMammals.hal --root mr

In this case, the run went perfectly. Toil ran 350 jobs locally, without any memory errors like I'm seeing here, and it only took 15 minutes of wall time. The full output of such a successful job is here: https://pastebin.com/WBJPegaf.

When running WITH the --batchSystem gridEngine option (as in all the previous attempts), i.e. like this:

cactus --binariesMode local cactusWork evolverMammals-offline.txt evolverMammals.hal --root mr --batchSystem gridEngine --workDir /export/home/ncit/external/a.mizeranschi/temp/cactus-test/cactusTemp --logInfo --logFile cactus.log --maxCores 32 --defaultMemory 30G --disableCaching

it takes around 3-4 hours from submission until it crashes, after around 200 jobs (a little over half-way through in terms of number of jobs), which crudely suggests an approx. 20x increase in wall time over the local, single-node run. This difference seems staggering.

How can this be explained? Is the overhead due to SGE's job scheduling so high? Or is Toil taking a longer time to manage the jobs when running over the scheduler? Whenever a job actually gets submitted (i.e. when it appears in the output of qstat -u <my_username>), I see it only takes a couple of seconds before it changes its state to running. Most of the time when running Cactus distributed, I only see the main job running, and none of the workers, even though 13 out of 14 cluster nodes are completely unused (i.e. it's not due to a lack of resources).

If there are 350 Toil jobs used to analyze these small input files, would this number get larger for plant- or mammal-sized genomes? Or does the number of Toil jobs only depend on the number of genomes that are being aligned?

It would be great if @glennhickey or @joelarmstrong could drop by with some advice about all of this, but in the mean time, do you have any idea for how to prevent the database server from failing in the first place? I did make the previous modification you suggested, in:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/466b5bb8727576d2f58c82222e71f1f2da65404e/src/cactus/pipeline/cactus_workflow.py#L188

I'm not sure why that didn't have the right effect.

I also tried setting up a local job using 5 yeast genomes, of around 12-14 megabase-pairs each (around 12-14 MB of space per each file, uncompressed).

Here's the exact list of FASTA files that I used for this test:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_genomic.fna.gz
mv GCA_000146045.2_R64_genomic.fna.gz SC_sacCer3.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/976/185/GCA_000976185.2_Sc_YJM555_v1/GCA_000976185.2_Sc_YJM555_v1_genomic.fna.gz
mv GCA_000976185.2_Sc_YJM555_v1_genomic.fna.gz SC_YJM555.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/766/175/GCA_000766175.2_ASM76617v2/GCA_000766175.2_ASM76617v2_genomic.fna.gz
mv GCA_000766175.2_ASM76617v2_genomic.fna.gz SC_YPS163.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/182/965/GCA_000182965.3_ASM18296v3/GCA_000182965.3_ASM18296v3_genomic.fna.gz
mv GCA_000182965.3_ASM18296v3_genomic.fna.gz CA_SC5314.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/545/GCA_000002545.2_ASM254v2/GCA_000002545.2_ASM254v2_genomic.fna.gz
mv GCA_000002545.2_ASM254v2_genomic.fna.gz CG_CBS138.fna.gz

I edited the FASTA header files to only keep one letter/word per header:

sed -i "s/>.* chromosome />/g" SC_sacCer3.fna
sed -i "s/, complete sequence//g" SC_sacCer3.fna

sed -i "s/>.* chromosome />/g" SC_YJM555.fna
sed -i "s/ genomic sequence//g" SC_YJM555.fna
sed -i "s/ sequence//g" SC_YJM555.fna
sed -i "s/>.*plasmid.*/>plasmid/g" SC_YJM555.fna
sed -i "s/>.*mitochondrion.*/>mitochondrion/g" SC_YJM555.fna

sed -i "s/>.*scaffold/>scaffold/g" SC_YPS163.fna
sed -i "s/, whole genome shotgun sequence//g" SC_YPS163.fna
sed -i "s/>.*unplaced/>unplaced/g" SC_YPS163.fna
sed -i "s/ mitochondrial//g" SC_YPS163.fna

sed -i "s/>.* chromosome />/g" CA_SC5314.fna
sed -i "s/ sequence//g" CA_SC5314.fna

sed -i "s/>.* chromosome />/g" CG_CBS138.fna
sed -i "s/ complete sequence//g" CG_CBS138.fna

I've then set up a Cactus run based on these five FASTA files, similar to the evolverMammals example, excluding the Newick tree.

Note: As I didn't have a phylogenetic (Newick) tree structure for these files to feed as input data, according to the documenation, it was expected of this run to use a higher number of Toil jobs compared to the evolverMammals example.

The local run took 1.5 hours (88 minutes, to be more precise) and used 890 Toil jobs. Its full output is here: https://pastebin.com/ZFz5QsNX.

I'll also try running the same thing with the option --batchSystem gridEngine.

Unfortunately the only thing I can think of to stop the database server dying with exit code 137 (which indicates that somebody killed it) is to make sure that when it first starts it has enough memory, which you are trying to do already.

You could try just manually hardcoding the memory you want for the KtServerService here, or injecting a bunch of real-time log statements for debugging and running with --realTimeLogging to see if you can work out exactly why your passed memory requirement isn't taking effect:

    >>> from toil.realtimeLogger import RealtimeLogger
    >>> RealtimeLogger.info("This logging message goes straight to the leader")

The number of Toil jobs will rise (roughly) quadratically with the genome size, and (roughly) linearly in the number of genomes. We usually see ~10-30k blast jobs when you get to mammal-sized genomes.

The KyotoTycoon server will use much, much more virtual memory (i.e. total address space) than it does in resident set size. IIRC it will use somewhere on the order of 10x more. Not usually a problem, since unused address space doesn't actually take up any physical memory, however some batch systems, like Parasol, limit the virtual memory of a process rather than the RSS. Maybe your version of GridEngine does this too? If so, you may need to provide a really, really massive amount of memory to the KTServerService job or change the batch system to work on the basis of RSS.

I can easily imagine the gridengine version taking over 20x the wall time on small examples. To be frank, toil's gridengine batchsystem is (depending on the version) either suboptimal or totally broken. You could try running with --maxLocalJobs 10000 -- IIRC it limits the number of submitted jobs to something embarassingly small by default.

The test based on yeast genomes failed in exactly the same way as the tests based on the evolverMammals small input files:

INFO:toil.leader:Finished toil run with 12 failed jobs.
INFO:toil.leader:Failed jobs at end of the run: 'RunCactusPreprocessorThenProgressiveDown' S/o/job51DlDr 'CactusBarRecursion' 2/R/jobsDixkf 'ProgressiveNext' P/3/jobUyelVB 'CactusBarPhase' C/J/jobXK2yjT 'KtServerService' N/g/job9uS5DS 'CactusSetupCheckpoint' y/U/jobCVFeXz 'ProgressiveUp' x/L/jobMNjZYG 'CactusTrimmingBlastPhase' u/S/jobhfwRMO 'StartPrimaryDB' o/H/jobK044oZ 'ProgressiveDown' g/l/jobFGVVE1 'CactusBarCheckpoint' J/O/jobgSECJU 'RunCactusPreprocessorThenProgressiveDown2' t/h/jobMDYFld
Traceback (most recent call last):
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/bin/cactus", line 11, in <module>
    sys.exit(main())
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 520, in main
    halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/common.py", line 784, in start
    return self._runMainLoop(rootJobGraph)
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/common.py", line 1059, in _runMainLoop
    jobCache=self._jobCache).run()
  File "/export/home/ncit/external/a.mizeranschi/toil_conda/lib/python2.7/site-packages/toil/leader.py", line 237, in run
    raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException

The full output is here: https://pastebin.com/eXS1SzbW.

I found the following page about resident set size and other memory settings in SGE: https://grid.ifca.es/wiki/Cluster/Usage/GridEngine#Memory_management. It looks like the highmem and h_rss settings could be helpful.

How can I get Toil to pass these options to the KyotoTycoon and any other relevant jobs that it submits?

There is a TOIL_GRIDENGINE_ARGS environment variable documented here, if you need to pass more arguments along with the grid engine jobs.

Closing this, as I got things working without distributing jobs on multiple cluster nodes. Thanks for your help so far.

ComparativeGenomicsToolkit / cactus

KTServer connection fails (ST_KV_DATABASE_EXCEPTION) while running Cactus on an SGE cluster #63

https://github.com/ComparativeGenomicsToolkit/cactus/blob/466b5bb8727576d2f58c82222e71f1f2da65404e/src/cactus/pipeline/cactus_workflow.py#L188