broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
979 stars 353 forks source link

Import zip missing from workflow #4117

Open samanehsan opened 5 years ago

samanehsan commented 5 years ago

While scale testing a pipeline last week, we saw this failure message in Cromwell-as-a-Service on 3/999 workflows:

Failed to import workflow SmartSeq2SingleSample.wdl.:\nBad import SmartSeq2SingleSample.wdl: Failed to resolve 'SmartSeq2SingleSample.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path"

All 999 workflows are the same and were started programmatically, but the three that have this import error are missing "imports" in the Cromwell metadata. We are providing a zipped directory containing all of the workflow dependencies instead of using http imports. It's unclear whether there was an issue submitting these workflows to Cromwell, or if the zip file was received by Cromwell but was not passed on to the workflow.

Here are the workflow IDs for the failed workflows:

cjllanwarne commented 5 years ago

A few more questions occurred to me -

samanehsan commented 5 years ago

Was the imports zip the same in each workflow? Yes, we were running the same workflow so the imports zip is the same. We get those import files by downloading them from github and adding them to a cache so that we don't have to download them for each workflow.

Were the workflows all the same? Yes, all of the workflows were the same

Do you have any logs from the sender to check that a zip was indeed sent? No, we just have logs that a workflow was submitted to Cromwell 😞

Were they submitted as a series of 999 separate submits or as a single batch submit POST? They were submitted as a series of 999 individual requests in the "On Hold" status, and a separate process sent requests to Cromwell to started one of those on-hold workflows every 10 seconds.

dshiga commented 5 years ago

Another example from this morning: d5ea4ae4-1644-4e71-98fb-368465c67a2c

kbergin commented 5 years ago

Of note, in the 'run this morning' we observed this 9 out of 2539 workflows submitted in the same way that Saman highlighted above. @cjllanwarne

dshiga commented 5 years ago

Update. I did extensive testing recently to determine whether we might be submitting an empty zip file accidentally, which would produce the same symptoms. That seems unlikely now.

I used Lira, the same service we use to submit to CaaS, I sent many post requests to a dummy service that I stood up purely to check the contents of the zip file. The zip was not truncated in any of the 25,000 post requests I sent. When submitting to CaaS recently, we saw this 8 times out of 2,500 submissions.

rexwangcc commented 5 years ago

Update, during the most recent scaling tests we ran, 37/4000 and 26/2544 workflows failed due to this issue respectively. It's on CaaS-prod. Some of the workflows:

kbergin commented 5 years ago

In a more recent scale test we observed this error on 37 out of 4000 workflows submitted. @geoffjentry @ruchim @danbills

ETA: Sorry, just saw that Rex already commented that (hadn't reloaded the page)

aednichols commented 5 years ago

I can see these workflows failing in the Cromwell logs, but I do not have the permissions to find out more from within the application (tried via Swagger)

[anichols@caas-cromwell-prod101 ~]$ docker logs caas_cromwell_1 | grep 7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-05 21:05:08,784 cromwell-system-akka.dispatchers.api-dispatcher-378 INFO  - Unspecified type (Unspecified version) workflow 7541e4e2-f74c-43f7-82af-8df891a27520 submitted
2018-11-06 04:55:58,066 cromwell-system-akka.dispatchers.engine-dispatcher-98 INFO  - Status changed to 'Submitted' for 7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-06 04:56:01,019 cromwell-system-akka.dispatchers.engine-dispatcher-123 INFO  - WorkflowManagerActor Starting workflow UUID(7541e4e2-f74c-43f7-82af-8df891a27520)
2018-11-06 04:56:01,019 cromwell-system-akka.dispatchers.engine-dispatcher-123 INFO  - WorkflowManagerActor Successfully started WorkflowActor-7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-06 04:56:01,068 cromwell-system-akka.dispatchers.engine-dispatcher-49 ERROR - WorkflowManagerActor Workflow 7541e4e2-f74c-43f7-82af-8df891a27520 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
2018-11-06 04:56:01,068 cromwell-system-akka.dispatchers.engine-dispatcher-49 INFO  - WorkflowManagerActor WorkflowActor-7541e4e2-f74c-43f7-82af-8df891a27520 is in a terminal state: WorkflowFailedState
rexwangcc commented 5 years ago

@aednichols I could give you access to the caas-collections!

aednichols commented 5 years ago

I don't really know anything concrete yet but I discovered there are about 25k temporary directories of the format /tmp/99104084175675660.zip2922990665093481334 still around on the CaaS server.

Those would be created by this code and notionally cleaned up by this code.

aednichols commented 5 years ago

If temp files are the problem it would be this code that's actually using them in the code path that fails

aednichols commented 5 years ago

Has this ever been reproduced without passing through on-hold status?

With the caveat that I do not know how on-hold really works, I'm suspicious that something might be happening to the temporary directory of unzipped imports between submission and execution time.

aednichols commented 5 years ago

I'd be very interested to see whether the imports zip makes it into the IMPORTS_ZIP column on WORKFLOW_STORE_ENTRY.

aednichols commented 5 years ago

We've merged a few improvements based on the investigation (due in Cromwell 37), but we cannot guarantee this specific problem is fixed.

The root cause is proving elusive and we find ourselves needing to pause for now due to schedule pressure.

samanehsan commented 5 years ago

Thanks for the update @aednichols!

tlangs commented 5 years ago

Hello! We seem to be running into a similar issue on cromwell 34-bda9485. After humming along without a problem for a while, we all of a sudden stopped being able to run workflows with zipped WDL imports. Looking at the metadata, we get:

"failures": [
{
"message": "Workflow input processing failed",
"causedBy": [
{
"causedBy": [],
"message": "pipelines%2Fdna_seq%2FUnmappedBamToAlignedBam.wdl: Name or service not known"
},
{
"causedBy": [],
"message": "java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)"
},...

Is this the same issue?

charlesdyan commented 5 years ago

I'd also like to note that we've now seen this at least 3 times in production in the past month and a bit. Once on Dec 21, once on Jan 10, and again today. So far the temporary fix has been simply restarting Cromwell.

aednichols commented 5 years ago

Referencing a (much earlier!) comment of mine:

I'd be very interested to see whether the imports zip makes it into the IMPORTS_ZIP column on WORKFLOW_STORE_ENTRY.

Is there a way you or we could check the DB for this?

tlangs commented 5 years ago

@hjfbynara Can you comment on this?

aednichols commented 5 years ago

Side note:

The reason you seem to be getting a DNS error Name or service not known is that when we take an import and try it against all of the available resolvers, we're using a function called firstSuccess that does exactly what you'd expect.

If all the resolvers fail, the error you get is from the last resolver that we tried, not the one it "should have used" (since we have no idea which one that would be ahead of time).

tlangs commented 5 years ago

That's exactly what I thought was happening.

aednichols commented 5 years ago

If you are running a bunch of workflows that all have zip imports, the result of

SELECT * FROM WORKFLOW_STORE_ENTRY WHERE IMPORTS_ZIP IS NOT NULL

should be the same as

SELECT * FROM WORKFLOW_STORE_ENTRY
aednichols commented 5 years ago

Note that you can set the workflowOnHold flag to keep the contents of WORKFLOW_STORE_ENTRY stable - otherwise, workflows have an annoying habit of running and finishing, at which point they disappear from the table

aednichols commented 5 years ago

See also https://github.com/broadinstitute/cromwell/issues/4406

aednichols commented 5 years ago

See also https://github.com/broadinstitute/cromwell/issues/1959

aednichols commented 5 years ago

Is this still happening on your current Cromwell version?