Closed bterlson closed 4 years ago
Different "batch"-like scenarios:
1) Single HTTP request, request body contains multiple items that will be operated upon. Each item can succeed/fail independently in addition to the whole envelope/request failing. Example: text analytics. 2) HTTP multi-part-mixed/http, multiple "nested" HTTP requests. Variation of 1). Example, storage batch. 3) Streaming/controlling "transactional" batches. Either all message succeed or all fail. Example, event hubs. 4) Streaming/improving efficiency. Lift burden of gathering messages/requests from the client application into the client library. Example: hypothetical buffered sender for event hubs.
Questions to be answered:
Additional twists to the story to consider or ignore for now:
a) dependencies between requests in the batch (OData supports this). I have not seen any services in Azure actually make use of this. b) heterogeneous operations - i.e. multiple completely unrelated requests (e.g. a couple of deletes, an upload or two and a get blob metadata properties).
Do retries complicate this at all? Do they need to trim out "successful" parts of the batch? If there are different failures between batches what does it return?
Additional twists:
c) client library does not know the batch size prior to sending the first batch?
d) encoding overhead (of conversion to JSON or XML, for example)?
e) Timing issues? (you can't safely do the encoding until point of transmission)
Note that each of the various types MIGHT be a separate pattern - we don't have to have a common pattern for all of them.
Concerns around Storage - AggregateException doesn't allow me to tie the exceptions to the original request.
@tg-msft @alzimmermsft @rakshith91 to provide code samples on how to reliably delete 500 blobs in a batch.
We will re-visit on Wednesday.
For .NET it would look something like this:
[Test]
public async Task Batch_Retries()
{
// Create a bunch of blobs and fill a queue with their Uris
using var scenario = Scenario();
Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
const int fullBatchSize = 256;
// Create a batch
BlobBatchClient client = scenario.GetBlobBatchClient();
BlobBatch batch = client.CreateBatch();
var batchTracking = new Dictionary<Response, Uri>();
int batchSize = 0;
// Keep iterating until all the blobs have been deleted
while (blobUris.Count > 0)
{
// Fill up the batch
while (blobUris.Count > 0 && batchSize < fullBatchSize)
{
Uri uri = blobUris.Dequeue();
Response response = batch.DeleteBlob(uri);
batchTracking[response] = uri;
batchSize++;
}
// Send the batch (and ignore failures since we're processing
// them ourselves)
await client.SubmitBatchAsync(batch, throwOnFailure: false);
// Create a new batch and add any retriable failures to it
batch = client.CreateBatch();
batchSize = 0;
var tracking = new Dictionary<Response, Uri>();
foreach ((Response previous, Uri uri) in batchTracking.Select(p => (p.Key, p.Value)))
{
// We're considering 503 retriable
if (previous.Status == 503)
{
// Add it to the next batch
Response response = batch.DeleteBlob(uri);
tracking[response] = uri;
batchSize++;
}
}
batchTracking = tracking;
}
}
The annoying bits are tracking the batch size (which we can fix), keeping enough info to replay failed tests (which we can fix), and grouping requests into appropriately sized batches (which we can probably fix).
We should add a public int RequestCount { get; }
to BlobBatch
so users can at least see how many requests they’ve added. It doesn’t solve the size problem in general, but it would help here and it’s not bad info to give people.
We can add a public BlobBatch GetRetryBatch()
to BlobBatch
that would use the pipeline’s ResponseClassifier
to determine which failures should be retriable and get a batch object that can be sent straight away.
We could potentially solve the size problem in a shared way. What if BlobBatch
has a BlobBatch GetUnsentBatch(bool includeRetriableFailures = true)
method that would return unsent requests because the size was too big, failures that could be retried, and null
when exhausted. We could add a parameter to BlobBatchClient.SubmitBatchAsync
that would keep sending until exhausted (defaulting to true
).
Here’s how we could simplify the code with 1 and 2:
[Test]
public async Task Batch_Retries2()
{
// Create a bunch of blobs and fill a queue with their Uris
using var scenario = Scenario();
Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
const int fullBatchSize = 256;
// Create a batch
BlobBatchClient client = scenario.GetBlobBatchClient();
BlobBatch batch = client.CreateBatch();
// Keep iterating until all the blobs have been deleted
while (blobUris.Count > 0)
{
// Fill up the batch
while (blobUris.Count > 0 && batch.RequestCount < fullBatchSize)
{
batch.DeleteBlob(blobUris.Dequeue());
}
// Send the batch (and ignore failures - true failures will be
// completely ignored, retriable failures will be resent)
await client.SubmitBatchAsync(batch, throwOnFailure: false);
// Create a new batch and add any retriable failures to it
batch = batch.GetRetryBatch();
}
}
And here’s how we could simplify the code with 3:
[Test]
public async Task Batch_Retries3()
{
// Create a bunch of blobs and get their Uris
using var scenario = Scenario();
var blobUris = await scenario.CreateBlobUrisAsync(1000);
// Create a batch full of deletes
BlobBatchClient client = scenario.GetBlobBatchClient();
BlobBatch batch = client.CreateBatch();
foreach (var uri in blobUris)
{
batch.DeleteBlob(uri);
}
// Send the batch (and ignore failures since we're processing
// them ourselves while continuing to send retriable failures
// until exhausted)
await client.SubmitBatchAsync(batch, throwOnFailure: false, untilExhausted: true);
}
I don't have the spare mental cycles to think of good names, but hopefully you get the idea.
From the architecture board:
Storage, right now, deals with homogeneous batches. We can't rule out heterogenous batches.
Johan introduced a new term "smurf".
Heterogenous batches are IN SCOPE for this discussion.
.NET (and Java)
There is no way to detect an explodey Response - need an isExplodey()
or similar.
We would love to have an "isRetryable" flag on the response for error conditions.
The code sample #3 gets over the size of batch issue (where the # items to be sent can't be known)
Python Q on whether we should throw on any error in the batch. The point is that the delete_blob vs delete_blobs APIs feel similar, but act differently.
The OData 3.0 batching spec is not clear. However, it would be a version bump change if they changed the behavior.
Board decision:
We need some more info from @johanste before closing on whether convenience methods should throw or not on a partial failure.
We need this in before the API breaking change window closes (maybe next Thursday)
Link to arch board recording: https://msit.microsoftstream.com/video/95f73359-e914-408f-b3f1-9fb9ac5a364f
Need guidelines for sending a sequence of things where the first N things succeed before there's a failure.