DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

Toil doesn't work with Cactus on Mesos #3034

Closed glennhickey closed 4 years ago

glennhickey commented 4 years ago

(Following up from #3032)

Using the current master: https://github.com/DataBiosphere/toil/commit/727b7e1c30938c575717b46f0bc9f88758e3b6e9

When I do (from python 3.6.9):

virtualenv -p python3 v3nv && . v3nv/bin/activate && pip install ~/toil.master
toil launch-cluster glenn --leaderNodeType=t2.medium --keyPairName glenn --zone us-west-2a
toil ssh-cluster --zone us-west-2a glenn

then

screen
virtualenv -p python3 --system-site-packages v3nv && . v3nv/bin/activate
git clone https://github.com/ComparativeGenomicsToolkit/cactus.git --recursive && cd cactus && pip install -U .
cactus --nodeTypes t2.2xlarge --minNodes 0 --maxNodes 4 --provisioner aws --batchSystem mesos --logDebug aws:us-west-2:deleteme ./examples/evolverMammals.txt output.hal --logFile evolverMammalsLogfile

I get

from cactus.shared.common import getOptionalAttrib
          File "/tmp/tmpg4385wbo/2bc6b9a586696b09ef20af951e39264d/cactus/shared/common.py", line 38, in <module>
            from sonLib.bioio import popenCatch
        ModuleNotFoundError: No module named 'sonLib'
        ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-31-44.us-west-2.compute.internal

The sonlib dependency is a bit sketchy: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/setup.py#L14-L41 But it seems to work fine in toil 3.24.0.

I'm curious as to why this is coming up for me but (apparently not) for @DailyDreaming who seemed to be able to run it https://github.com/DataBiosphere/toil/issues/3032#issuecomment-611324292

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-538

glennhickey commented 4 years ago

I just tried Python 3.7 by making the cluster with

TOIL_APPLIANCE_SELF=quay.io/ucsc_cg/toil:4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9-py3.7

Then on the cluster (identical to https://github.com/DataBiosphere/toil/issues/3032#issuecomment-611324292)

virtualenv -p python3.7 --system-site-packages v3nv && . v3nv/bin/activate
git clone https://github.com/ComparativeGenomicsToolkit/cactus.git --recursive && cd cactus && pip install -U .
cactus --nodeTypes t2.2xlarge --minNodes 0 --maxNodes 4 --provisioner aws --batchSystem mesos --logDebug aws:us-west-2:deleteme ./examples/evolverMammals.txt output.hal --logFile evolverMammalsLogfile

And the problem reproduces

Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
    INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
    DEBUG:toil:Configuration: {'workflowID': '56faec38-4a89-42a6-81e8-697f900ddc0b', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore3', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.17.218:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
    DEBUG:toil.deferred:Running for file /var/lib/toil/node-56faec38-4a89-42a6-81e8-697f900ddc0b-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/funcb_x30pns
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    DEBUG:toil.worker:Parsed job wrapper
    DEBUG:toil.jobStores.aws.jobStore:Updating job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    DEBUG:toil.worker:Got a command to run: _toil 1608ea10-3ce3-4159-9d76-364ca577f340 //v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
    DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
    ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
        job = Job._loadJob(jobGraph.command, jobStore)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
        userModule = cls._loadUserModule(userModule)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
        return userModule.load()
      File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
        return importlib.import_module(module.name)
      File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/progressive/cactus_progressive.py", line 27, in <module>
        from cactus.shared.common import getOptionalAttrib
      File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/shared/common.py", line 38, in <module>
        from sonLib.bioio import popenCatch
    ModuleNotFoundError: No module named 'sonLib'
    ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-25-89.us-west-2.compute.internal
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to 3
<=========
Added job: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to active jobs
Built the jobs list, currently have 1 jobs to update and 0 jobs issued
Updating status of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4: with result status: 0
Queueing the job command: _toil_worker RunCactusPreprocessorThenProgressiveDown aws:us-west-2:glennhickey-jobstore3 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job id: 3 ...
... queued
Issued job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job batch system ID: 3 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Got offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O3 for a non-preemptable agent with 31145.00 MiB memory, 8.00 core(s) and 42121.00 MiB of disk.
Preparing to launch Mesos task 3 with 1.00 cores, 3400.00 MiB memory, and 2048.00 MiB disk using offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O3 ...
Launched Mesos task 3.
Job 3 is in state 'TASK_RUNNING' due to reason '{}'.
Job 3 is in state 'TASK_FINISHED' due to reason '{}'.
Job 3 ended with status 0, took 1.6182279586791992 seconds.
Job ended: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 continues to exist (i.e. has more to do)
Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Reading FileInfo(fileID='d90b771e-3d31-4b85-9aaf-cd3f5c66ef34', ownerID='19e5628b-7713-46da-8c93-9ee72a0ce9a4', encrypted=False, version='Gz00.mzCG8SdXkx8qoumKwRGEsLVZ.W9', previousVersion='Gz00.mzCG8SdXkx8qoumKwRGEsLVZ.W9', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fcb00706250> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
    INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
    DEBUG:toil:Configuration: {'workflowID': '56faec38-4a89-42a6-81e8-697f900ddc0b', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore3', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.17.218:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
    DEBUG:toil.deferred:Running for file /var/lib/toil/node-56faec38-4a89-42a6-81e8-697f900ddc0b-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/funcwu25s7ef
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    DEBUG:toil.worker:Parsed job wrapper
    DEBUG:toil.jobStores.aws.jobStore:Updating job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    DEBUG:toil.worker:Got a command to run: _toil 1608ea10-3ce3-4159-9d76-364ca577f340 //v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
    DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
    ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
        job = Job._loadJob(jobGraph.command, jobStore)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
        userModule = cls._loadUserModule(userModule)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
        return userModule.load()
      File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
        return importlib.import_module(module.name)
      File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/progressive/cactus_progressive.py", line 27, in <module>
        from cactus.shared.common import getOptionalAttrib
      File "/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123/cactus/shared/common.py", line 38, in <module>
        from sonLib.bioio import popenCatch
    ModuleNotFoundError: No module named 'sonLib'
    ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-25-89.us-west-2.compute.internal
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to 2
<=========
Added job: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 to active jobs
Built the jobs list, currently have 1 jobs to update and 0 jobs issued
Updating status of job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with ID 19e5628b-7713-46da-8c93-9ee72a0ce9a4: with result status: 0
Queueing the job command: _toil_worker RunCactusPreprocessorThenProgressiveDown aws:us-west-2:glennhickey-jobstore3 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job id: 4 ...
... queued
Issued job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 with job batch system ID: 4 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Got offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O4 for a non-preemptable agent with 31145.00 MiB memory, 8.00 core(s) and 42121.00 MiB of disk.
Preparing to launch Mesos task 4 with 1.00 cores, 3400.00 MiB memory, and 2048.00 MiB disk using offer 1a42bc92-8952-40e5-9ffb-a4fe8c9e7a0e-O4 ...
Launched Mesos task 4.
Job 4 is in state 'TASK_RUNNING' due to reason '{}'.
Job 4 is in state 'TASK_FINISHED' due to reason '{}'.
Job 4 ended with status 0, took 1.557612419128418 seconds.
Job ended: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Job 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4 continues to exist (i.e. has more to do)
Loaded job 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Reading FileInfo(fileID='ffd2f094-59e6-4f4d-844c-79d28d4e21eb', ownerID='19e5628b-7713-46da-8c93-9ee72a0ce9a4', encrypted=False, version='RbZPR4AeifqzFS46NhPEOEhLHzWjOhIu', previousVersion='RbZPR4AeifqzFS46NhPEOEhLHzWjOhIu', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fcb00706250> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 19e5628b-7713-46da-8c93-9ee72a0ce9a4
Log from job 19e5628b-7713-46da-8c93-9ee72a0ce9a4 follows:
=========>
DailyDreaming commented 4 years ago

@glennhickey I used virtualenv -p python3.7 --system-site-packages v3nv && . v3nv/bin/activate instead of virtualenv -p python3 --system-site-packages v3nv && . v3nv/bin/activate, so that might be the problem?

For me it ran, but seems to be be running infinitely. I expected it to finish hours ago but it's still going strong since last night.

DailyDreaming commented 4 years ago

Current output is something like:

All nodes in cluster: [Instance:i-079b1a8428c700227]
All workers found in cluster: [Instance:i-079b1a8428c700227]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227]
Running bin packing for node shapes [Shape(wallTime=3600, memory=34252364185.6, cores=8, disk=53687091200, preemptable=False)] and 1 job(s).
Nodes of type t2.2xlarge to run queued jobs = 1
Estimating 1 nodes of shape Shape(wallTime=3600, memory=34252364185.6, cores=8, disk=53687091200, preemptable=False)
Currently 1 nodes of type t2.2xlarge in cluster
All nodes in cluster: [Instance:i-0e14e4864cd031f99, Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
All workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227]
Consolidating state between mesos and provisioner
Worker node at 172.31.27.15 is not reporting executor information
Cluster contains 1 instances
Cluster contains 1 instances of type t2.2xlarge (0 ignored and draining jobs until they can be safely terminated)
Cluster (minus ignored nodes) already at desired size of 1. Nothing to do.
All nodes in cluster: [Instance:i-0e14e4864cd031f99, Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
All workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
non-preemptable workers found in cluster: [Instance:i-079b1a8428c700227, Instance:i-04a8d7efd6a04c99d]
Consolidating state between mesos and provisioner
Worker node at 172.31.27.15 is not reporting executor information
Worker node at 172.31.19.225 is not reporting executor information
There are 0 nodes being ignored by the batch system, checking if they can be terminated
glennhickey commented 4 years ago

I also tried -p python3.7 -- see above comment: https://github.com/DataBiosphere/toil/issues/3034#issuecomment-611573974

glennhickey commented 4 years ago

My error only happens once a worker starts running jobs, which seems to be further along than you've gotten?

DailyDreaming commented 4 years ago

@glennhickey That is quite possible. Either mine also has a sonLib error but is unable to report back and is stopping earlier, or it didn't have a sonLib error and something further down the line kept it from reporting back. I'm not sure why my particular error is occurring. I'll post once I dig into the worker logs.

adamnovak commented 4 years ago

Hm. I would suspect the resource system of not working:

DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='//v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).

ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpa62_zv04/35de369a1fb88fb032a016ffb3481123']).

But it looks like this is the only recent change to it, just removing a warning:

https://github.com/DataBiosphere/toil/commit/2d4e0cdc188a6c0d4c21a6a968bf80ec6b0473bc#diff-48df337a3ff448f830d34e00cd4cfcaf

Maybe it has something to do with the workdir refactoring I did? But the resource system just uses mkdtemp to find a place to put things:

https://github.com/DataBiosphere/toil/blob/727b7e1c30938c575717b46f0bc9f88758e3b6e9/src/toil/resource.py#L113-L115

adamnovak commented 4 years ago

@DailyDreaming When Cactus gets stuck for you, are you just seeing https://github.com/ComparativeGenomicsToolkit/cactus/issues/122#issuecomment-610129025 maybe?

adamnovak commented 4 years ago

@glennhickey I think your problem is that extra leading slash. The Resources it is downloading don't think they come from a virtualenv (fromVirtualEnv=False) because when we decide if a file is coming from a virtualenv we just look for a path prefix:

https://github.com/DataBiosphere/toil/blob/727b7e1c30938c575717b46f0bc9f88758e3b6e9/src/toil/resource.py#L430

But for some reason (maybe because the virtualenv is directly under /? Maybe because we do some brittle-looking string manipulation? Maybe because we have a relative import somewhere?) we're getting files that think they come from //v3nv and don't match the prefix check.

Try adding a cd after you connect to the cluster, so you put your virtualenv in /root/v3nv instead.

DailyDreaming commented 4 years ago

@adamnovak Edit: Edit...

Mesos might actually be broken? Launching a new cluster and running:

from toil.common import Toil
from toil.job import Job

def trivial(job):
    return True

def main():
    options = Job.Runner.getDefaultOptions('here')
    options.logLevel = 'debug'
    options.clean = 'always'
    options.retryCount = 1
    options.batchSystem = 'mesos'
    options.mesosMasterAddress = 'localhost:5050'
    options.realTimeLogging = True
    # options.caching = False

    with Toil(options) as workflow:
        workflow.start(Job.wrapJobFn(trivial))

if __name__ == '__main__':
    main()

Trying to sanity check myself, but the workflow seems to hang forever. If I can, I'll go in with a debugger.

glennhickey commented 4 years ago

@adamnovak Changing the path did not help. Get the same error message, but with a single leading slash (using exact same steps as https://github.com/DataBiosphere/toil/issues/3034#issuecomment-611573974 but in /root/work instead of /)

Job 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a continues to exist (i.e. has more to do)
Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
Reading FileInfo(fileID='94a896c6-aa7b-4154-ba64-0a62ce63afb7', ownerID='43c15340-2253-4e4a-8921-ef7cf442fc3a', encrypted=False, version='XbQgvNUfyUSbHLBP_8F5aNAPWRunjlGj', previousVersion='XbQgvNUfyUSbHLBP_8F5aNAPWRunjlGj', content=None, checksum='', _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f74410ba910> into stream.
The job seems to have left a log file, indicating failure: 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a
Log from job 43c15340-2253-4e4a-8921-ef7cf442fc3a follows:
=========>
    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
    INFO:toil:Running Toil version 4.1.0a1-727b7e1c30938c575717b46f0bc9f88758e3b6e9.
    DEBUG:toil:Configuration: {'workflowID': '0042506c-86dd-4dcc-8bae-5eb280ff9428', 'workflowAttemptNumber': 0, 'jobStore': 'aws:us-west-2:glennhickey-jobstore5', 'logLevel': 'DEBUG', 'workDir': None, 'noStdOutErr': False, 'stats': False, 'clean': 'onSuccess', 'cleanWorkDir': 'always', 'clusterStats': None, 'restart': False, 'batchSystem': 'mesos', 'disableAutoDeployment': False, 'environment': {}, 'statePollingWait': 1, 'maxLocalJobs': 2, 'manualMemArgs': False, 'parasolCommand': 'parasol', 'parasolMaxBatches': 10000, 'scale': 1, 'linkImports': False, 'moveExports': False, 'mesosMasterAddress': '172.31.24.53:5050', 'kubernetesHostPath': None, 'provisioner': 'aws', 'nodeTypes': ['t2.2xlarge'], 'nodeOptions': None, 'minNodes': [0], 'maxNodes': [4], 'targetTime': 1800, 'betaInertia': 0.1, 'scaleInterval': 60, 'preemptableCompensation': 0.0, 'nodeStorage': 50, 'metrics': False, 'maxPreemptableServiceJobs': 9223372036854775807, 'maxServiceJobs': 9223372036854775807, 'deadlockWait': 3600, 'defaultMemory': 2147483648, 'defaultCores': 1, 'defaultDisk': 2147483648, 'readGlobalFileMutableByDefault': False, 'defaultPreemptable': False, 'maxCores': 9223372036854775807, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'retryCount': 5, 'enableUnlimitedPreemptableRetries': False, 'maxJobDuration': 9223372036854775807, 'rescueJobsFrequency': 3600, 'disableCaching': True, 'disableChaining': True, 'disableJobStoreChecksumVerification': False, 'maxLogFileSize': 64000, 'writeLogs': None, 'writeLogsGzip': None, 'writeLogsFromAllJobs': False, 'sseKey': None, 'servicePollingInterval': 60, 'useAsync': True, 'forceDockerAppliance': False, 'runCwlInternalJobsOnWorkers': False, 'debugWorker': False, 'disableWorkerOutputCapture': False, 'badWorker': 0.0, 'badWorkerFailInterval': 0.01, 'cwl': False}
    DEBUG:toil.deferred:Running for file /var/lib/toil/node-0042506c-86dd-4dcc-8bae-5eb280ff9428-e7fcf2eb18e05686d238ad5f5e8ea6a5/deferred/func8_9shsgf
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
    DEBUG:toil.worker:Parsed job wrapper
    DEBUG:toil.worker:Got a command to run: _toil db6cbec9-88fa-472d-b001-f16155489a5e /root/work/v3nv/lib/python3.7/site-packages cactus.progressive.cactus_progressive False
    DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/root/work/v3nv/lib/python3.7/site-packages', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False).
    ERROR:toil.resource:Failed to import user module ModuleDescriptor(dirPath='/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f', name='cactus.progressive.cactus_progressive', fromVirtualEnv=False) from sys.path (['/usr/local/bin', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f']).
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/dist-packages/toil/worker.py", line 350, in workerScript
        job = Job._loadJob(jobGraph.command, jobStore)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 914, in _loadJob
        userModule = cls._loadUserModule(userModule)
      File "/usr/local/lib/python3.7/dist-packages/toil/job.py", line 891, in _loadUserModule
        return userModule.load()
      File "/usr/local/lib/python3.7/dist-packages/toil/resource.py", line 609, in load
        return importlib.import_module(module.name)
      File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 728, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f/cactus/progressive/cactus_progressive.py", line 27, in <module>
        from cactus.shared.common import getOptionalAttrib
      File "/tmp/tmpvg9o9__t/d25453bc830b5c88aa29be6f11b0b58f/cactus/shared/common.py", line 38, in <module>
        from sonLib.bioio import popenCatch
    ModuleNotFoundError: No module named 'sonLib'
    ERROR:toil.worker:Exiting the worker because of a failed job on host ip-172-31-23-150.us-west-2.compute.internal
    DEBUG:toil.jobStores.aws.jobStore:Loaded job 43c15340-2253-4e4a-8921-ef7cf442fc3a
    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'RunCactusPreprocessorThenProgressiveDown' 43c15340-2253-4e4a-8921-ef7cf442fc3a with ID 43c15340-2253-4e4a-8921-ef7cf442fc3a to 5
adamnovak commented 4 years ago

@DailyDreaming Did you actually put any nodes in your cluster? Because that test workflow isn't running a scaler.

I'll see if I can shake the bugs out with current master.

DailyDreaming commented 4 years ago

@adamnovak Not explicitly. I'd based it on how sort.py was tested, and tested on a launch-cluster instance. I'll see if I can spin up docker mesos-worker images to act as nodes.

adamnovak commented 4 years ago

I can run a trivial workflow on Mesos successfully, with current Toil master:

# Don't steal my 8089. Get your own port.
# But with this set up you can get the Mesos web UI as http://localhost:5050, when you set your browser to use the SOCKS proxy on localhost:8080.
ssh -L8080:localhost:8089 courtyard.gi.ucsc.edu
export TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:4.1.0a1-aa9812b27d175e60b1f8073ed745a5df9340fcf2-py3.7
toil launch-cluster anovak-test --leaderNodeType=t2.medium --keyPairName anovak@kolossus --zone us-west-2a
toil ssh-cluster --sshOption=-D8089 --zone us-west-2a anovak-test

# No virtualenv, just use system python3.7
cd
cat >workflow.py <<EOF
import argparse
import sys

from toil.common import Toil
from toil.job import Job
from toil.realtimeLogger import RealtimeLogger

def main():
    parser = argparse.ArgumentParser(description=main.__doc__)
    Job.Runner.addToilOptions(parser)
    options = parser.parse_args(sys.argv[1:])

    hello_job = Job.wrapJobFn(succeed)

    with Toil(options) as toil:
        toil.start(hello_job)

def succeed(job):
    RealtimeLogger.info('It me')
    return True

if __name__=="__main__":
    main()
EOF

# Run the workflow. It spends most of its time waiting around for the worker to start; it does take like 5 minutes
python3.7 workflow.py --nodeTypes t2.medium --minNodes 0 --maxNodes 4 --provisioner aws --batchSystem mesos --logDebug --realTimeLogging aws:us-west-2:anovak-toil-test
adamnovak commented 4 years ago

I can also replicate the trouble with Cactus, in that it can't find sonLib because I guess it isn't shipping it. But hacking Toil on the leader here to just say fromVirtualEnv=True gets Cactus running for me. So something must be wrong there about how we determine where a module comes from.

adamnovak commented 4 years ago

The problem is that Toil doesn't know it is in a virtualenv:

(inenv) root@ip-172-31-28-10:~/cactus# python3.7 -i
Python 3.7.7 (default, Mar 10 2020, 17:25:08) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import toil
>>> import toil.resource
>>> toil.resource.inVirtualEnv()
False
>>> import sys
>>> sys.executable
'/root/cactus/inenv/bin/python3.7'
>>> sys.real_prefix
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'sys' has no attribute 'real_prefix'
>>> sys.prefix
'/root/cactus/inenv'
adamnovak commented 4 years ago

I think someone may have swapped out our virtualenv with something that makes a pyvenv instead of a "real" virtualenv, somehow. We should follow https://stackoverflow.com/a/58026969 and look at sys.base_prefix == sys.prefix or the VIRTUAL_ENV environment variable.

As a workaround, setting CONDA_DEFAULT_ENV=literallyanything in the environment might make Cactus work without hacking the Toil code.