Closed glennhickey closed 4 years ago
I just tried Python 3.7 by making the cluster with
TOIL_APPLIANCE_SELF=quay.io/ucsc_cg/toil:4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9-py3.7
Then on the cluster (identical to https://github.com/DataBiosphere/toil/issues/3032#issuecomment-611324292)
virtualenv -p python3.7 --system-site-packages v3nv && . v3nv/bin/activate
git clone https://github.com/ComparativeGenomicsToolkit/cactus.git --recursive && cd cactus && pip install -U .
cactus --nodeTypes t2.2xlarge --minNodes 0 --maxNodes 4 --provisioner aws --batchSystem mesos --logDebug aws:us-west-2:deleteme ./examples/evolverMammals.txt output.hal --logFile evolverMammalsLogfile
And the problem reproduces
Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
DEBUG:toil:Configuration: {'workflowID': '56faec38-4a89-42a6-81e8-697f900ddc0b', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore3', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.17.218:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
DEBUG:toil.deferred:Running for file /var/lib/toil/node-56faec38-4a89-42a6-81e8-697f900ddc0b-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/funcb_x30pns
DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
DEBUG:toil.worker:Parsed job wrapper
DEBUG:toil.jobStores.aws.jobStore:Updating job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
DEBUG:toil.worker:Got a command to run: _toil 1608ea10-3ce3-4159-9d76-364ca577f340 //v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
job = Job._loadJob(jobGraph.command, jobStore)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
userModule = cls._loadUserModule(userModule)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
return userModule.load()
File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
return importlib.import_module(module.name)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/progressive/cactus_progressive.py", line 27, in <module>
from cactus.shared.common import getOptionalAttrib
File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/shared/common.py", line 38, in <module>
from sonLib.bioio import popenCatch
ModuleNotFoundError: No module named 'sonLib'
ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-25-89.us-west-2.compute.internal
DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to 3
<=========
Added job: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to active jobs
Built the jobs list, currently have 1 jobs to update and 0 jobs issued
Updating status of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4: with result status: 0
Queueing the job command: _toil_worker RunCactusPreprocessorThenProgressiveDown aws:us-west-2:glennhickey-jobstore3 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job id: 3 ...
... queued
Issued job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job batch system ID: 3 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Got offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O3 for a non-preemptable agent with 31145.00 MiB memory, 8.00 core(s) and 42121.00 MiB of disk.
Preparing to launch Mesos task 3 with 1.00 cores, 3400.00 MiB memory, and 2048.00 MiB disk using offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O3 ...
Launched Mesos task 3.
Job 3 is in state 'TASK_RUNNING' due to reason '{}'.
Job 3 is in state 'TASK_FINISHED' due to reason '{}'.
Job 3 ended with status 0, took 1.6182279586791992 seconds.
Job ended: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 continues to exist (i.e. has more to do)
Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Reading FileInfo(fileID='d90b771e-3d31-4b85-9aaf-cd3f5c66ef34', ownerID='19e5628b-7713-46da-8c93-9ee72a0ce9a4', encrypted=False, version='Gz00.mzCG8SdXkx8qoumKwRGEsLVZ.W9', previousVersion='Gz00.mzCG8SdXkx8qoumKwRGEsLVZ.W9', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fcb00706250> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
DEBUG:toil:Configuration: {'workflowID': '56faec38-4a89-42a6-81e8-697f900ddc0b', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore3', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.17.218:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
DEBUG:toil.deferred:Running for file /var/lib/toil/node-56faec38-4a89-42a6-81e8-697f900ddc0b-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/funcwu25s7ef
DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
DEBUG:toil.worker:Parsed job wrapper
DEBUG:toil.jobStores.aws.jobStore:Updating job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
DEBUG:toil.worker:Got a command to run: _toil 1608ea10-3ce3-4159-9d76-364ca577f340 //v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
job = Job._loadJob(jobGraph.command, jobStore)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
userModule = cls._loadUserModule(userModule)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
return userModule.load()
File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
return importlib.import_module(module.name)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/progressive/cactus_progressive.py", line 27, in <module>
from cactus.shared.common import getOptionalAttrib
File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/shared/common.py", line 38, in <module>
from sonLib.bioio import popenCatch
ModuleNotFoundError: No module named 'sonLib'
ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-25-89.us-west-2.compute.internal
DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to 2
<=========
Added job: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to active jobs
Built the jobs list, currently have 1 jobs to update and 0 jobs issued
Updating status of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4: with result status: 0
Queueing the job command: _toil_worker RunCactusPreprocessorThenProgressiveDown aws:us-west-2:glennhickey-jobstore3 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job id: 4 ...
... queued
Issued job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job batch system ID: 4 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Got offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O4 for a non-preemptable agent with 31145.00 MiB memory, 8.00 core(s) and 42121.00 MiB of disk.
Preparing to launch Mesos task 4 with 1.00 cores, 3400.00 MiB memory, and 2048.00 MiB disk using offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O4 ...
Launched Mesos task 4.
Job 4 is in state 'TASK_RUNNING' due to reason '{}'.
Job 4 is in state 'TASK_FINISHED' due to reason '{}'.
Job 4 ended with status 0, took 1.557612419128418 seconds.
Job ended: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 continues to exist (i.e. has more to do)
Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Reading FileInfo(fileID='ffd2f094-59e6-4f4d-844c-79d28d4e21eb', ownerID='19e5628b-7713-46da-8c93-9ee72a0ce9a4', encrypted=False, version='RbZPR4AeifqzFS46NhPEOEhLHzWjOhIu', previousVersion='RbZPR4AeifqzFS46NhPEOEhLHzWjOhIu', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fcb00706250> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
@glennhickey I used virtualenv -p python3.7 --system-site-packages v3nv && . v3nv/bin/activate
instead of virtualenv -p python3 --system-site-packages v3nv && . v3nv/bin/activate
, so that might be the problem?
For me it ran, but seems to be be running infinitely. I expected it to finish hours ago but it's still going strong since last night.
Current output is something like:
All nodes in cluster: [Instance:i-079b1a8428c700227]
All workers found in cluster: [Instance:i-079b1a8428c700227]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227]
Running bin packing for node shapes [Shape(wallTime=3600, memory=34252364185.6, cores=8, disk=53687091200, preemptable=False)] and 1 job(s).
Nodes of type t2.2xlarge to run queued jobs = 1
Estimating 1 nodes of shape Shape(wallTime=3600, memory=34252364185.6, cores=8, disk=53687091200, preemptable=False)
Currently 1 nodes of type t2.2xlarge in cluster
All nodes in cluster: [Instance:i-0e14e4864cd031f99, Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
All workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227]
Consolidating state between mesos and provisioner
Worker node at 172.31.27.15 is not reporting executor information
Cluster contains 1 instances
Cluster contains 1 instances of type t2.2xlarge (0 ignored and draining jobs until they can be safely terminated)
Cluster (minus ignored nodes) already at desired size of 1. Nothing to do.
All nodes in cluster: [Instance:i-0e14e4864cd031f99, Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
All workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
Consolidating state between mesos and provisioner
Worker node at 172.31.27.15 is not reporting executor information
Worker node at 172.31.19.225 is not reporting executor information
There are 0 nodes being ignored by the batch system, checking if they can be terminated
I also tried -p python3.7
-- see above comment: https://github.com/DataBiosphere/toil/issues/3034#issuecomment-611573974
My error only happens once a worker starts running jobs, which seems to be further along than you've gotten?
@glennhickey That is quite possible. Either mine also has a sonLib error but is unable to report back and is stopping earlier, or it didn't have a sonLib error and something further down the line kept it from reporting back. I'm not sure why my particular error is occurring. I'll post once I dig into the worker logs.
Hm. I would suspect the resource system of not working:
DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).
But it looks like this is the only recent change to it, just removing a warning:
Maybe it has something to do with the workdir refactoring I did? But the resource system just uses mkdtemp
to find a place to put things:
@DailyDreaming When Cactus gets stuck for you, are you just seeing https://github.com/ComparativeGenomicsToolkit/cactus/issues/122#issuecomment-610129025 maybe?
@glennhickey I think your problem is that extra leading slash. The Resource
s it is downloading don't think they come from a virtualenv (fromVirtualEnv=False
) because when we decide if a file is coming from a virtualenv we just look for a path prefix:
But for some reason (maybe because the virtualenv is directly under /
? Maybe because we do some brittle-looking string manipulation? Maybe because we have a relative import somewhere?) we're getting files that think they come from //v3nv
and don't match the prefix check.
Try adding a cd
after you connect to the cluster, so you put your virtualenv in /root/v3nv
instead.
@adamnovak Edit: Edit...
Mesos might actually be broken? Launching a new cluster and running:
from toil.common import Toil
from toil.job import Job
def trivial(job):
return True
def main():
options = Job.Runner.getDefaultOptions('here')
options.logLevel = 'debug'
options.clean = 'always'
options.retryCount = 1
options.batchSystem = 'mesos'
options.mesosMasterAddress = 'localhost:5050'
options.realTimeLogging = True
# options.caching = False
with Toil(options) as workflow:
workflow.start(Job.wrapJobFn(trivial))
if __name__ == '__main__':
main()
Trying to sanity check myself, but the workflow seems to hang forever. If I can, I'll go in with a debugger.
@adamnovak Changing the path did not help. Get the same error message, but with a single leading slash (using exact same steps as https://github.com/DataBiosphere/toil/issues/3034#issuecomment-611573974 but in /root/work
instead of /
)
Job 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a continues to exist (i.e. has more to do)
Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
Reading FileInfo(fileID='94a896c6-aa7b-4154-ba64-0a62ce63afb7', ownerID='43c15340-2253-4e4a-8921-ef7cf442fc3a', encrypted=False, version='XbQgvNUfyUSbHLBP_8F5aNAPWRunjlGj', previousVersion='XbQgvNUfyUSbHLBP_8F5aNAPWRunjlGj', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f74410ba910> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a
Log from job 43c15340-2253-4e4a-8921-ef7cf442fc3a follows:
=========>
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
DEBUG:toil:Configuration: {'workflowID': '0042506c-86dd-4dcc-8bae-5eb280ff9428', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore5', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.24.53:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
DEBUG:toil.deferred:Running for file /var/lib/toil/node-0042506c-86dd-4dcc-8bae-5eb280ff9428-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/func8_9shsgf
DEBUG:toil.jobStores.aws.jobStore:Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
DEBUG:toil.worker:Parsed job wrapper
DEBUG:toil.worker:Got a command to run: _toil db6cbec9-88fa-472d-b001-f16155489a5e /root/work/v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/root/work/v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f']).
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
job = Job._loadJob(jobGraph.command, jobStore)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
userModule = cls._loadUserModule(userModule)
File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
return userModule.load()
File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
return importlib.import_module(module.name)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f/cactus/progressive/cactus_progressive.py", line 27, in <module>
from cactus.shared.common import getOptionalAttrib
File "/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f/cactus/shared/common.py", line 38, in <module>
from sonLib.bioio import popenCatch
ModuleNotFoundError: No module named 'sonLib'
ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-23-150.us-west-2.compute.internal
DEBUG:toil.jobStores.aws.jobStore:Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a with ID 43c15340-2253-4e4a-8921-ef7cf442fc3a to 5
@DailyDreaming Did you actually put any nodes in your cluster? Because that test workflow isn't running a scaler.
I'll see if I can shake the bugs out with current master.
@adamnovak Not explicitly. I'd based it on how sort.py was tested, and tested on a launch-cluster
instance. I'll see if I can spin up docker mesos-worker images to act as nodes.
I can run a trivial workflow on Mesos successfully, with current Toil master:
# Don't steal my 8089. Get your own port.
# But with this set up you can get the Mesos web UI as http://localhost:5050, when you set your browser to use the SOCKS proxy on localhost:8080.
ssh -L8080:localhost:8089 courtyard.gi.ucsc.edu
export TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:4.1.0a1-aa9812b27d175e60b1f8073ed745a5df9340fcf2-py3.7
toil launch-cluster anovak-test --leaderNodeType=t2.medium --keyPairName anovak@kolossus --zone us-west-2a
toil ssh-cluster --sshOption=-D8089 --zone us-west-2a anovak-test
# No virtualenv, just use system python3.7
cd
cat >workflow.py <<EOF
import argparse
import sys
from toil.common import Toil
from toil.job import Job
from toil.realtimeLogger import RealtimeLogger
def main():
parser = argparse.ArgumentParser(description=main.__doc__)
Job.Runner.addToilOptions(parser)
options = parser.parse_args(sys.argv[1:])
hello_job = Job.wrapJobFn(succeed)
with Toil(options) as toil:
toil.start(hello_job)
def succeed(job):
RealtimeLogger.info('It me')
return True
if __name__=="__main__":
main()
EOF
# Run the workflow. It spends most of its time waiting around for the worker to start; it does take like 5 minutes
python3.7 workflow.py --nodeTypes t2.medium --minNodes 0 --maxNodes 4 --provisioner aws --batchSystem mesos --logDebug --realTimeLogging aws:us-west-2:anovak-toil-test
I can also replicate the trouble with Cactus, in that it can't find sonLib because I guess it isn't shipping it. But hacking Toil on the leader here to just say fromVirtualEnv=True
gets Cactus running for me. So something must be wrong there about how we determine where a module comes from.
The problem is that Toil doesn't know it is in a virtualenv:
(inenv) root@ip-172-31-28-10:~/cactus# python3.7 -i
Python 3.7.7 (default, Mar 10 2020, 17:25:08)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import toil
>>> import toil.resource
>>> toil.resource.inVirtualEnv()
False
>>> import sys
>>> sys.executable
'/root/cactus/inenv/bin/python3.7'
>>> sys.real_prefix
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'sys' has no attribute 'real_prefix'
>>> sys.prefix
'/root/cactus/inenv'
I think someone may have swapped out our virtualenv
with something that makes a pyvenv
instead of a "real" virtualenv
, somehow. We should follow https://stackoverflow.com/a/58026969 and look at sys.base_prefix == sys.prefix
or the VIRTUAL_ENV
environment variable.
As a workaround, setting CONDA_DEFAULT_ENV=literallyanything
in the environment might make Cactus work without hacking the Toil code.
(Following up from #3032)
Using the current master: https://github.com/DataBiosphere/toil/commit/727b7e1c30938c575717b46f0bc9f88758e3b6e9
When I do (from python 3.6.9):
then
I get
The sonlib dependency is a bit sketchy: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/setup.py#L14-L41 But it seems to work fine in toil 3.24.0.
I'm curious as to why this is coming up for me but (apparently not) for @DailyDreaming who seemed to be able to run it https://github.com/DataBiosphere/toil/issues/3032#issuecomment-611324292
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-538