broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 361 forks source link

GcsFileSystemProvider is still using a blocking retry #1279

Closed geoffjentry closed 7 years ago

geoffjentry commented 8 years ago

When investigating the GCS ls issue from the green folks in 0.19 I saw that it was using one of our old style blocking retries. I then assumed that there's no way we'd still be doing that and told everyone that we weren't. Then I went and double checked what I said and sadly I was wrong.

GcsFileSystemProvider.withRetry blocks threads. With the new dispatcher bulkheading it should be better, but considering how many of these might be in flight at once in a joint genotyping sort of environment, this seems bad.

pgrosu commented 8 years ago

Hi Jeff,

But the Google Storage backend endpoints still operate on the idea of an exponential backoff, as both are REST-based:

https://cloud.google.com/storage/docs/xml-api/overview

https://cloud.google.com/storage/docs/json_api/

So there is no way around needing to wait. Maybe having an exponential backoff would help instead of the Thread.sleep(retryInterval.toMillis.toInt) of 500 milliseconds, as now you're synchronizing a subset of threads to wake up at the same time - becoming similar to a Thundering herd problem - which in this case is a subcluster of threads that would succeed instead of just one.

Hope it helps, ~p

geoffjentry commented 8 years ago

@pgrosu Most of our retries (all of them besides this, actually) use the actor system to wait until the next try instead of blocking a thread. It's the thread blocking that is problematic.

pgrosu commented 8 years ago

@geoffjentry Ah thanks, so basically Akka's ActorSystem with Retry.scala, right?

geoffjentry commented 8 years ago

Exactly. When you use Akka's scheduler it's not blocking a thread, which leads to much happiness.

pgrosu commented 8 years ago

And with some minimal parental supervision ;) Super duper, many thanks Jeff!

kshakir commented 8 years ago

If we switch to #1187, we will need our own retries outside of the FileSystemProvider anyway.

cjllanwarne commented 7 years ago

Closing as a duplicate of newer, more ambition, scalability tickets