Closed geoffjentry closed 7 years ago
Hi Jeff,
But the Google Storage backend endpoints still operate on the idea of an exponential backoff, as both are REST-based:
https://cloud.google.com/storage/docs/xml-api/overview
https://cloud.google.com/storage/docs/json_api/
So there is no way around needing to wait. Maybe having an exponential backoff would help instead of the Thread.sleep(retryInterval.toMillis.toInt)
of 500 milliseconds, as now you're synchronizing a subset of threads to wake up at the same time - becoming similar to a Thundering herd problem - which in this case is a subcluster of threads that would succeed instead of just one.
Hope it helps, ~p
@pgrosu Most of our retries (all of them besides this, actually) use the actor system to wait until the next try instead of blocking a thread. It's the thread blocking that is problematic.
@geoffjentry Ah thanks, so basically Akka's ActorSystem with Retry.scala, right?
Exactly. When you use Akka's scheduler it's not blocking a thread, which leads to much happiness.
And with some minimal parental supervision ;) Super duper, many thanks Jeff!
If we switch to #1187, we will need our own retries outside of the FileSystemProvider anyway.
Closing as a duplicate of newer, more ambition, scalability tickets
When investigating the GCS ls issue from the green folks in 0.19 I saw that it was using one of our old style blocking retries. I then assumed that there's no way we'd still be doing that and told everyone that we weren't. Then I went and double checked what I said and sadly I was wrong.
GcsFileSystemProvider.withRetry blocks threads. With the new dispatcher bulkheading it should be better, but considering how many of these might be in flight at once in a joint genotyping sort of environment, this seems bad.