broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

Intermittent Workflow Failure: "Google credentials are invalid: connect timed out" #1886

Closed kcibul closed 7 years ago

kcibul commented 7 years ago

I ran 123 workflows as part of a single submission in FireCloud which means, same WDL and identical inputs aside from the BAM and sample name. They all started around the same time.

3 failed with this failure:

Google credentials are invalid: connect timed out

Here is the deep-link to the FireCloud run:

https://api.firecloud.org/api/workspaces/engle-macarthur-ccdd/genomes-reprocessing/submissions/c7af7e06-a435-44ec-8466-124ad8e1bcaf/workflows/a714b11b-0162-4585-afa5-abbd7433af51

Here is the full metadata for the failed workflow:

{ "workflowName": "BamToUnmappedBams", "submittedFiles": { "inputs": "{\"BamToUnmappedBams.input_bam\":\"gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/batch04/S64-2_Illumina.bam\"}", "workflow": "task RevertSam {\n File input_bam\n String revert_bam_name\n Int disk_size\n\n # TODO: why is SORT_ORDER=coordinate set below since we sort it again in the next step?\n # TODO: why did we need this line?\n # OUTPUT_MAP=${output_map} \\n command {\n java -Xmx3000m -jar /usr/gitc/picard.jar \\n RevertSam \\n INPUT=${input_bam} \\n OUTPUT=${revert_bam_name} \\n VALIDATION_STRINGENCY=LENIENT \\n ATTRIBUTE_TO_CLEAR=FT \\n ATTRIBUTE_TO_CLEAR=XS \\n SORT_ORDER=queryname \\n MAX_RECORDS_IN_RAM=1000000 \n }\n runtime {\n docker: \"broadinstitute/genomes-in-the-cloud:2.2.3-1469027018\"\n disks: \"local-disk \" + disk_size + \" HDD\"\n memory: \"3500 MB\"\n }\n output {\n File unmapped_bam = \"${revert_bam_name}\"\n }\n}\n\ntask SortSam {\n File input_bam\n String sorted_bam_name\n Int disk_size\n\n # TODO: why not use samtools sort as it is multi-threaded?\n command {\n java -Xmx3000m -jar /usr/gitc/picard.jar \\n SortSam \\n INPUT=${input_bam} \\n OUTPUT=${sorted_bam_name} \\n SORT_ORDER=queryname \\n MAX_RECORDS_IN_RAM=1000000\n }\n runtime {\n docker: \"broadinstitute/genomes-in-the-cloud:2.2.3-1469027018\"\n disks: \"local-disk \" + disk_size + \" HDD\"\n memory: \"3500 MB\"\n }\n output {\n File sorted_bam = \"${sorted_bam_name}\"\n }\n}\n\ntask ValidateSamFile {\n File input_bam\n String report_filename\n Int disk_size\n\n command {\n java -Xmx3000m -jar /usr/gitc/picard.jar \\n ValidateSamFile \\n INPUT=${input_bam} \\n OUTPUT=${report_filename} \\n MODE=VERBOSE \\n IS_BISULFITE_SEQUENCED=false \n }\n runtime {\n docker: \"broadinstitute/genomes-in-the-cloud:2.2.3-1469027018\"\n disks: \"local-disk \" + disk_size + \" HDD\"\n memory: \"3500 MB\"\n }\n output {\n File report = \"${report_filename}\"\n }\n}\n\nworkflow BamToUnmappedBams {\n File input_bam\n String dir_pattern = \"gs://./\"\n #String dir_pattern = \"/./\"\n Int revert_sam_disk_size = 400\n Int sort_sam_disk_size = 400\n Int validate_sam_file_disk_size = 200\n\n call RevertSam {\n input:\n input_bam = input_bam,\n revert_bam_name = sub(sub(input_bam, dir_pattern, \"\"), \".bam$\", \"\") + \".unmapped.bam\",\n disk_size = revert_sam_disk_size\n }\n\n# call SortSam {\n# input:\n# input_bam = RevertSam.unmapped_bam,\n# sorted_bam_name = sub(sub(RevertSam.unmapped_bam, dir_pattern, \"\"), \".bam$\", \"\") + \".sorted.bam\",\n# disk_size = sort_sam_disk_size\n# }\n\n call ValidateSamFile {\n input:\n input_bam = RevertSam.unmapped_bam,\n report_filename = sub(sub(RevertSam.unmapped_bam, dir_pattern, \"\"), \".unmapped.bam$\", \"\") + \".validation_report\",\n disk_size = validate_sam_file_disk_size\n }\n\n output {\n RevertSam.\n ValidateSamFile.\n }\n}", "options": "{\n \"default_runtime_attributes\": {\n \"zones\": \"us-central1-b us-central1-c us-central1-f\"\n },\n \"google_project\": \"engle-macarthur-ccdd\",\n \"auth_bucket\": \"gs://cromwell-auth-engle-macarthur-ccdd\",\n \"refresh_token\": \"cleared\",\n \"final_workflow_log_dir\": \"gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/c7af7e06-a435-44ec-8466-124ad8e1bcaf/workflow.logs\",\n \"account_name\": \"kcibul@broadinstitute.org\",\n \"jes_gcs_root\": \"gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/c7af7e06-a435-44ec-8466-124ad8e1bcaf\"\n}" }, "calls": {

}, "outputs": {

}, "id": "a714b11b-0162-4585-afa5-abbd7433af51", "inputs": { "BamToUnmappedBams.input_bam": "gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/batch04/S64-2_Illumina.bam" }, "submission": "2017-01-19T18:17:12.188Z", "status": "Failed", "failures": [{ "message": "Google credentials are invalid: connect timed out" }], "workflowLog": "gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/c7af7e06-a435-44ec-8466-124ad8e1bcaf/workflow.logs/workflow.a714b11b-0162-4585-afa5-abbd7433af51.log", "end": "2017-01-19T18:17:39.673Z", "start": "2017-01-19T18:17:19.606Z" }

kcibul commented 7 years ago

@ruchim -- I'd like to get this looked at soon, let's chat about the contents of the milestone

geoffjentry commented 7 years ago

@kcibul This appears to be a dupe of #1436 - I recommend keeping one as the official issue and closing the other.

katevoss commented 7 years ago

@geoffjentry is this still a problem? @kcibul

geoffjentry commented 7 years ago

@Horneth AFAIK this still exists.

jmthibault79 commented 7 years ago

FYI @katevoss we are seeing this in FireCloud (Alpha environment, "special snowflake Cromwell 26 hotfix 2" aka 70741da6)

Not often: on the order of 1 out of 10,000.

helgridly commented 7 years ago

Saw it again on C26 snowflake:

"failures": [{
    "causedBy": [{
      "causedBy": [],
      "message": "Read timed out"
    }],
    "message": "Google credentials are invalid: Read timed out"
  }]
cjllanwarne commented 7 years ago

cf: https://github.com/broadinstitute/cromwell/blob/develop/supportedBackends/jes/src/main/scala/cromwell/backend/impl/jes/JesInitializationActor.scala#L51