ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
529 stars 111 forks source link

Trouble running on Slurm #299

Open egoltsman opened 4 years ago

egoltsman commented 4 years ago

Hi, Up until now I've been running cactus in singleMachine mode, but now decided to try the cluster option. I've set the TOIL_SLURM_ARGS to a set that I normally use with Slurm batch jobs, but the process doesn't even get to the actual job submission and dies while trying to connect to the Toil server (for credentials??). It smells like a toil configuration issue, but I can't be sure. I turned on the debug logging so that you can hopefully see/guess what's wrong.

cactus jobStore ../samples_SOFTMASKED.input cactus-output/graph.hal --batchSystem Slurm  --binariesMode local --logDebug

cori12 2020-08-21 14:13:09,020 MainThread DEBUG toil.lib.bioio: Root logger is at level 'DEBUG', 'toil' logger at level 'DEBUG'.
cori12 2020-08-21 14:13:09,021 MainThread DEBUG toil.lib.bioio: Root logger is at level 'DEBUG', 'toil' logger at level 'DEBUG'.
cori12 2020-08-21 14:13:09,021 MainThread DEBUG toil.lib.threading: Total machine size: 64 cores
cori12 2020-08-21 14:13:09,021 MainThread DEBUG toil.lib.threading: CPU quota: -1
cori12 2020-08-21 14:13:09,193 MainThread DEBUG rdflib: RDFLib Version: 4.2.2
cori12 2020-08-21 14:13:09,347 MainThread DEBUG toil.jobStores.fileJobStore: Path to job store directory is '/global/projectb/scratch/eugeneg/ASSEMBLY/B.distachyon_PanGenome/TEST_pacBio_assemblies/cactus/on_Slurm/jobStore'.
cori12 2020-08-21 14:13:09,350 MainThread DEBUG toil.jobStores.abstractJobStore: The workflow ID is: 'eac13180-c42f-40e7-8a25-e40479e05586'
No branch length for Ref: setting to 1
No branch length for Bd1_1: setting to 1
No branch length for Bd21_3: setting to 1
No branch length for Bd30_1: setting to 1
No branch length for : setting to 1
No branch length for : setting to 1
No branch length for : setting to 1
cori12 2020-08-21 14:13:09,353 MainThread INFO cactus.progressive.projectWrapper: Using config from path /global/homes/e/eugeneg/.conda/envs/cactus/lib/python3.8/site-packages/cactus/cactus_progressive_config.xml.
cori12 2020-08-21 14:13:09,369 MainThread INFO toil.lib.bioio: xmlRoot = <multi_cactus inputSequences="/global/homes/e/eugeneg/ASSEMBLY_IN_PROGRESS/B.distachyon_PanGenome/TEST_pacBio_assemblies/Bdistachyon_556_v3.0.softmasked.fa.chr1 /global/homes/e/eugeneg/ASSEMBLY_IN_PROGRESS/B.distachyon_PanGenome/TEST_pacBio_assemblies/BdistachyonBd1_1_549_v1.0.softmasked.fa.chr1 /global/homes/e/eugeneg/ASSEMBLY_IN_PROGRESS/B.distachyon_PanGenome/TEST_pacBio_assemblies/BdistachyonBd21_3_537_v1.0.softmasked.fa.chr1 /global/homes/e/eugeneg/ASSEMBLY_IN_PROGRESS/B.distachyon_PanGenome/TEST_pacBio_assemblies/BdistachyonBd30_1_515_v1.0.softmasked.fa.chr1" inputSequenceNames="Ref Bd1_1 Bd21_3 Bd30_1">
        <tree>(Ref:1.0,(Bd1_1:1.0,(Bd21_3:1.0,(Bd30_1:1.0)Anc3:1.0)Anc2:1.0)Anc1:1.0)Anc0;</tree>
        <cactus name="Anc0" experiment_path="/tmp/tmptzxlneja/progressiveAlignment/Anc0/Anc0_experiment.xml" />
        <cactus name="Anc1" experiment_path="/tmp/tmptzxlneja/progressiveAlignment/Anc1/Anc1_experiment.xml" />
        <cactus name="Anc2" experiment_path="/tmp/tmptzxlneja/progressiveAlignment/Anc2/Anc2_experiment.xml" />
        <cactus name="Anc3" experiment_path="/tmp/tmptzxlneja/progressiveAlignment/Anc3/Anc3_experiment.xml" />
</multi_cactus>
cori12 2020-08-21 14:13:09,481 MainThread DEBUG botocore.hooks: Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
cori12 2020-08-21 14:13:09,483 MainThread DEBUG botocore.hooks: Changing event name from before-call.apigateway to before-call.api-gateway
cori12 2020-08-21 14:13:09,484 MainThread DEBUG botocore.hooks: Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
cori12 2020-08-21 14:13:09,485 MainThread DEBUG botocore.hooks: Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
cori12 2020-08-21 14:13:09,485 MainThread DEBUG botocore.hooks: Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
cori12 2020-08-21 14:13:09,486 MainThread DEBUG botocore.hooks: Changing event name from request-created.cloudsearchdomain.Search to request-created.cloudsearch-domain.Search
cori12 2020-08-21 14:13:09,487 MainThread DEBUG botocore.hooks: Changing event name from docs.*.autoscaling.CreateLaunchConfiguration.complete-section to docs.*.auto-scaling.CreateLaunchConfiguration.complete-section
cori12 2020-08-21 14:13:09,489 MainThread DEBUG botocore.hooks: Changing event name from before-parameter-build.logs.CreateExportTask to before-parameter-build.cloudwatch-logs.CreateExportTask
cori12 2020-08-21 14:13:09,489 MainThread DEBUG botocore.hooks: Changing event name from docs.*.logs.CreateExportTask.complete-section to docs.*.cloudwatch-logs.CreateExportTask.complete-section
cori12 2020-08-21 14:13:09,489 MainThread DEBUG botocore.hooks: Changing event name from before-parameter-build.cloudsearchdomain.Search to before-parameter-build.cloudsearch-domain.Search
cori12 2020-08-21 14:13:09,489 MainThread DEBUG botocore.hooks: Changing event name from docs.*.cloudsearchdomain.Search.complete-section to docs.*.cloudsearch-domain.Search.complete-section
cori12 2020-08-21 14:13:09,526 MainThread DEBUG botocore.loaders: Loading JSON file: /global/homes/e/eugeneg/.conda/envs/cactus/lib/python3.8/site-packages/boto3/data/s3/2006-03-01/resources-1.json
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: env
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: assume-role
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: assume-role-with-web-identity
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: sso
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: shared-credentials-file
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: custom-process
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: config-file
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: ec2-credentials-file
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: boto-config
cori12 2020-08-21 14:13:09,528 MainThread DEBUG botocore.credentials: Looking for credentials via: container-role
cori12 2020-08-21 14:13:09,529 MainThread DEBUG botocore.credentials: Looking for credentials via: iam-role
cori12 2020-08-21 14:13:09,529 MainThread DEBUG urllib3.connectionpool: Starting new HTTP connection (1): 169.254.169.254:80
cori12 2020-08-21 14:13:10,531 MainThread DEBUG botocore.utils: Caught retryable HTTP exception while making metadata service request to http://169.254.169.254/latest/api/token: Connect timeout on endpoint URL: "http://169.254.169.254/latest/api/token"
Traceback (most recent call last):
  File "/global/homes/e/eugeneg/.conda/envs/cactus/lib/python3.8/site-packages/toil/lib/memoize.py", line 35, in new_f
    return memory[args]
KeyError: (FileJobStore(/global/projectb/scratch/eugeneg/ASSEMBLY/B.distachyon_PanGenome/TEST_pacBio_assemblies/cactus/on_Slurm/jobStore),)
dylandebaun commented 3 years ago

Hi, I'm having the same issue. Did you ever figure out what was going on?

glennhickey commented 3 years ago

A couple of things:

If that doesn't help, all I can suggest is to please make a Toil issue