Open samanehsan opened 5 years ago
A few more questions occurred to me -
Was the imports zip the same in each workflow? Yes, we were running the same workflow so the imports zip is the same. We get those import files by downloading them from github and adding them to a cache so that we don't have to download them for each workflow.
Were the workflows all the same? Yes, all of the workflows were the same
Do you have any logs from the sender to check that a zip was indeed sent? No, we just have logs that a workflow was submitted to Cromwell 😞
Were they submitted as a series of 999 separate submits or as a single batch submit POST? They were submitted as a series of 999 individual requests in the "On Hold" status, and a separate process sent requests to Cromwell to started one of those on-hold workflows every 10 seconds.
Another example from this morning: d5ea4ae4-1644-4e71-98fb-368465c67a2c
Of note, in the 'run this morning' we observed this 9 out of 2539 workflows submitted in the same way that Saman highlighted above. @cjllanwarne
Update. I did extensive testing recently to determine whether we might be submitting an empty zip file accidentally, which would produce the same symptoms. That seems unlikely now.
I used Lira, the same service we use to submit to CaaS, I sent many post requests to a dummy service that I stood up purely to check the contents of the zip file. The zip was not truncated in any of the 25,000 post requests I sent. When submitting to CaaS recently, we saw this 8 times out of 2,500 submissions.
Update, during the most recent scaling tests we ran, 37/4000 and 26/2544 workflows failed due to this issue respectively. It's on CaaS-prod. Some of the workflows:
In a more recent scale test we observed this error on 37 out of 4000 workflows submitted. @geoffjentry @ruchim @danbills
ETA: Sorry, just saw that Rex already commented that (hadn't reloaded the page)
I can see these workflows failing in the Cromwell logs, but I do not have the permissions to find out more from within the application (tried via Swagger)
[anichols@caas-cromwell-prod101 ~]$ docker logs caas_cromwell_1 | grep 7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-05 21:05:08,784 cromwell-system-akka.dispatchers.api-dispatcher-378 INFO - Unspecified type (Unspecified version) workflow 7541e4e2-f74c-43f7-82af-8df891a27520 submitted
2018-11-06 04:55:58,066 cromwell-system-akka.dispatchers.engine-dispatcher-98 INFO - Status changed to 'Submitted' for 7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-06 04:56:01,019 cromwell-system-akka.dispatchers.engine-dispatcher-123 INFO - WorkflowManagerActor Starting workflow UUID(7541e4e2-f74c-43f7-82af-8df891a27520)
2018-11-06 04:56:01,019 cromwell-system-akka.dispatchers.engine-dispatcher-123 INFO - WorkflowManagerActor Successfully started WorkflowActor-7541e4e2-f74c-43f7-82af-8df891a27520
2018-11-06 04:56:01,068 cromwell-system-akka.dispatchers.engine-dispatcher-49 ERROR - WorkflowManagerActor Workflow 7541e4e2-f74c-43f7-82af-8df891a27520 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
2018-11-06 04:56:01,068 cromwell-system-akka.dispatchers.engine-dispatcher-49 INFO - WorkflowManagerActor WorkflowActor-7541e4e2-f74c-43f7-82af-8df891a27520 is in a terminal state: WorkflowFailedState
@aednichols I could give you access to the caas-collections!
If temp files are the problem it would be this code that's actually using them in the code path that fails
Has this ever been reproduced without passing through on-hold status?
With the caveat that I do not know how on-hold really works, I'm suspicious that something might be happening to the temporary directory of unzipped imports between submission and execution time.
I'd be very interested to see whether the imports zip makes it into the IMPORTS_ZIP
column on WORKFLOW_STORE_ENTRY
.
We've merged a few improvements based on the investigation (due in Cromwell 37), but we cannot guarantee this specific problem is fixed.
The root cause is proving elusive and we find ourselves needing to pause for now due to schedule pressure.
Thanks for the update @aednichols!
Hello! We seem to be running into a similar issue on cromwell 34-bda9485
. After humming along without a problem for a while, we all of a sudden stopped being able to run workflows with zipped WDL imports. Looking at the metadata, we get:
"failures": [
{
"message": "Workflow input processing failed",
"causedBy": [
{
"causedBy": [],
"message": "pipelines%2Fdna_seq%2FUnmappedBamToAlignedBam.wdl: Name or service not known"
},
{
"causedBy": [],
"message": "java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)"
},...
Is this the same issue?
I'd also like to note that we've now seen this at least 3 times in production in the past month and a bit. Once on Dec 21, once on Jan 10, and again today. So far the temporary fix has been simply restarting Cromwell.
Referencing a (much earlier!) comment of mine:
I'd be very interested to see whether the imports zip makes it into the IMPORTS_ZIP column on WORKFLOW_STORE_ENTRY.
Is there a way you or we could check the DB for this?
@hjfbynara Can you comment on this?
Side note:
The reason you seem to be getting a DNS error Name or service not known
is that when we take an import and try it against all of the available resolvers, we're using a function called firstSuccess
that does exactly what you'd expect.
If all the resolvers fail, the error you get is from the last resolver that we tried, not the one it "should have used" (since we have no idea which one that would be ahead of time).
That's exactly what I thought was happening.
If you are running a bunch of workflows that all have zip imports, the result of
SELECT * FROM WORKFLOW_STORE_ENTRY WHERE IMPORTS_ZIP IS NOT NULL
should be the same as
SELECT * FROM WORKFLOW_STORE_ENTRY
Note that you can set the workflowOnHold
flag to keep the contents of WORKFLOW_STORE_ENTRY
stable - otherwise, workflows have an annoying habit of running and finishing, at which point they disappear from the table
Is this still happening on your current Cromwell version?
While scale testing a pipeline last week, we saw this failure message in Cromwell-as-a-Service on 3/999 workflows:
Failed to import workflow SmartSeq2SingleSample.wdl.:\nBad import SmartSeq2SingleSample.wdl: Failed to resolve 'SmartSeq2SingleSample.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path"
All 999 workflows are the same and were started programmatically, but the three that have this import error are missing "imports" in the Cromwell metadata. We are providing a zipped directory containing all of the workflow dependencies instead of using http imports. It's unclear whether there was an issue submitting these workflows to Cromwell, or if the zip file was received by Cromwell but was not passed on to the workflow.
Here are the workflow IDs for the failed workflows: