Open Horneth opened 7 years ago
Unlike the stacktrace suggests this is not specifically related to the refreshMetadataSummaryEntries
, it's just a consequence of the slick queue being overflowed.
See https://github.com/slick/slick/issues/1183 and https://github.com/slick/slick/issues/1683
This starts happening again on a 20k wide scatter (with call cache read OFF). Metadata can be lost as writes can fail without failing the workflow.
Is upping queueSize
the answer here?
It will probably help a little bit, but won't guarantee that it's not going to happen. My concern is more about the fact that we can lose random metadata batches. Really any DB query can fail, some of them it's not that bad like summarizing metadata, others are fatal to the workflow, which is bad but at least fails the workflow, some are silent like fail to write metadata (silent as in you'll see it in the logs but your metadata will be incomplete without a way to know really what's missing)
Well that's what happens when we design something in a way where that's semi-intentional :) We should sit down and figure out how to work all of this in a way which doesn't tie up the whole system (i.e. the reason we went down this path in the first place)
@geoffjentry what's the status of this?
We should leave this open. This is basically the same thing @danbills has been poking at for Firecloud but we weren't able to reproduce it. For their side of things we discovered that they weren't taking advantage of metadata batching, which they're going to change. It likely won't solve the issue but should make it robust enough that they don't see it anymore.
However the underlying problem is still lurking.
I believe the related google doc is Slick Heartburn. @geoffjentry have we chosen a plan of attack yet?
What is the recommendation for resolving this problem? I am getting the following:
{
u'status': u'fail',
u'message': u'Task slick.basic.BasicBackend$DatabaseDef$$anon$2@2dbcf781 rejected from slick.util.AsyncExecutor$$anon$2$$anon$1@6dbdf3be[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 550175]'
}
when calling the query
endpoint.
It happens episodically. If I call query
again, it often responds just fine.
I'm particularly curious about the message indicating:
queued tasks = 1000
There is not much going on with this instance:
$ curl http://localhost:8000/engine/v1/stats
{"workflows":24,"jobs":115}
$ curl http://localhost:8000/engine/v1/version
{"cromwell":"33-215cca9-SNAP"}
How should I interpret having 1000 queued tasks?
Thanks!
I think I have resolved this.
My query
call always includes a workflow name, but I had been issuing an unrestricted query and doing the filtering client side.
When I change the query call to filter on name
, it returns successfully on a consistent basis.
So I interpret this to mean that it was the query
call itself that was generating a large number of queued tasks.
Just to be clear the "tasks" referred to here are Slick tasks and not Cromwell / WDL tasks (that error message is produced by the Slick library). I'm speculating a bit but it may be that the unrestricted query was tying up the database for so long that too many tasks backed up behind it and overflowed the Slick task queue of size 1000. More restrictive server-side filtering like you're doing now definitely seems like a good idea. đŸ™‚
Since @mcovarr and @Horneth are currently looking at call caching, please make sure that this no longer is an issue
Tried to re-run the 10K JG workflow with CC on, the workflow failed almost immediately with multiple errors like