broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
989 stars 358 forks source link

Workflow fails with slick error with large scale call cache #2182

Open Horneth opened 7 years ago

Horneth commented 7 years ago

Tried to re-run the 10K JG workflow with CC on, the workflow failed almost immediately with multiple errors like

[ERROR] [04/18/2017 21:11:44.685] [cromwell-system-akka.dispatchers.service-dispatcher-86] [akka://cromwell-system/user/cromwell-service/ServiceRegistryActor/MetadataService/metadata-summary-actor] Failed to summarize metadata
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$2@64919660 rejected from slick.util.AsyncExecutor$$anon$2$$anon$1@1dce40e4[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 800]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at slick.util.AsyncExecutor$$anon$2$$anon$3.execute(AsyncExecutor.scala:120)
    at slick.basic.BasicBackend$DatabaseDef$class.runSynchronousDatabaseAction(BasicBackend.scala:233)
    at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:38)
    at slick.basic.BasicBackend$DatabaseDef$class.runInContext(BasicBackend.scala:210)
    at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:38)
    at slick.basic.BasicBackend$DatabaseDef$class.run$1(BasicBackend.scala:153)
    at slick.basic.BasicBackend$DatabaseDef$class.runInContext(BasicBackend.scala:157)
    at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:38)
    at slick.basic.BasicBackend$DatabaseDef$class.runInContext(BasicBackend.scala:179)
    at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:38)
    at slick.basic.BasicBackend$DatabaseDef$class.runInternal(BasicBackend.scala:78)
    at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:38)
    at slick.basic.BasicBackend$DatabaseDef$class.run(BasicBackend.scala:75)
    at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:38)
    at cromwell.database.slick.SlickDatabase.runTransaction(SlickDatabase.scala:103)
    at cromwell.database.slick.MetadataSlickDatabase$class.refreshMetadataSummaryEntries(MetadataSlickDatabase.scala:133)
    at cromwell.database.slick.SlickDatabase.refreshMetadataSummaryEntries(SlickDatabase.scala:54)
    at cromwell.services.metadata.impl.MetadataDatabaseAccess$class.refreshWorkflowMetadataSummaries(MetadataDatabaseAccess.scala:146)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor.refreshWorkflowMetadataSummaries(MetadataSummaryRefreshActor.scala:35)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor$$anonfun$1.applyOrElse(MetadataSummaryRefreshActor.scala:45)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor$$anonfun$1.applyOrElse(MetadataSummaryRefreshActor.scala:43)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at akka.actor.FSM$class.processEvent(FSM.scala:663)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor.akka$actor$LoggingFSM$$super$processEvent(MetadataSummaryRefreshActor.scala:35)
    at akka.actor.LoggingFSM$class.processEvent(FSM.scala:799)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor.processEvent(MetadataSummaryRefreshActor.scala:35)
    at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:657)
    at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:651)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:496)
    at cromwell.services.metadata.impl.MetadataSummaryRefreshActor.aroundReceive(MetadataSummaryRefreshActor.scala:35)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
Horneth commented 7 years ago

Unlike the stacktrace suggests this is not specifically related to the refreshMetadataSummaryEntries, it's just a consequence of the slick queue being overflowed. See https://github.com/slick/slick/issues/1183 and https://github.com/slick/slick/issues/1683

Horneth commented 7 years ago

This starts happening again on a 20k wide scatter (with call cache read OFF). Metadata can be lost as writes can fail without failing the workflow.

geoffjentry commented 7 years ago

Is upping queueSize the answer here?

Horneth commented 7 years ago

It will probably help a little bit, but won't guarantee that it's not going to happen. My concern is more about the fact that we can lose random metadata batches. Really any DB query can fail, some of them it's not that bad like summarizing metadata, others are fatal to the workflow, which is bad but at least fails the workflow, some are silent like fail to write metadata (silent as in you'll see it in the logs but your metadata will be incomplete without a way to know really what's missing)

geoffjentry commented 7 years ago

Well that's what happens when we design something in a way where that's semi-intentional :) We should sit down and figure out how to work all of this in a way which doesn't tie up the whole system (i.e. the reason we went down this path in the first place)

katevoss commented 7 years ago

@geoffjentry what's the status of this?

geoffjentry commented 7 years ago

We should leave this open. This is basically the same thing @danbills has been poking at for Firecloud but we weren't able to reproduce it. For their side of things we discovered that they weren't taking advantage of metadata batching, which they're going to change. It likely won't solve the issue but should make it robust enough that they don't see it anymore.

However the underlying problem is still lurking.

katevoss commented 7 years ago

I believe the related google doc is Slick Heartburn. @geoffjentry have we chosen a plan of attack yet?

mbookman commented 6 years ago

What is the recommendation for resolving this problem? I am getting the following:

{
  u'status': u'fail',
  u'message': u'Task slick.basic.BasicBackend$DatabaseDef$$anon$2@2dbcf781 rejected from slick.util.AsyncExecutor$$anon$2$$anon$1@6dbdf3be[Running, pool size = 20, active threads = 20, queued tasks = 1000, completed tasks = 550175]'
}

when calling the query endpoint. It happens episodically. If I call query again, it often responds just fine.

I'm particularly curious about the message indicating:

queued tasks = 1000

There is not much going on with this instance:

$ curl http://localhost:8000/engine/v1/stats
{"workflows":24,"jobs":115}

$ curl http://localhost:8000/engine/v1/version
{"cromwell":"33-215cca9-SNAP"}

How should I interpret having 1000 queued tasks?

Thanks!

mbookman commented 6 years ago

I think I have resolved this. My query call always includes a workflow name, but I had been issuing an unrestricted query and doing the filtering client side.

When I change the query call to filter on name, it returns successfully on a consistent basis. So I interpret this to mean that it was the query call itself that was generating a large number of queued tasks.

mcovarr commented 6 years ago

Just to be clear the "tasks" referred to here are Slick tasks and not Cromwell / WDL tasks (that error message is produced by the Slick library). I'm speculating a bit but it may be that the unrestricted query was tying up the database for so long that too many tasks backed up behind it and overflowed the Slick task queue of size 1000. More restrictive server-side filtering like you're doing now definitely seems like a good idea. đŸ™‚

geoffjentry commented 6 years ago

Since @mcovarr and @Horneth are currently looking at call caching, please make sure that this no longer is an issue