bterlson commented 5 years ago

Need guidelines for sending a sequence of things where the first N things succeed before there's a failure.

johanste commented 4 years ago

Different "batch"-like scenarios:

1) Single HTTP request, request body contains multiple items that will be operated upon. Each item can succeed/fail independently in addition to the whole envelope/request failing. Example: text analytics. 2) HTTP multi-part-mixed/http, multiple "nested" HTTP requests. Variation of 1). Example, storage batch. 3) Streaming/controlling "transactional" batches. Either all message succeed or all fail. Example, event hubs. 4) Streaming/improving efficiency. Lift burden of gathering messages/requests from the client application into the client library. Example: hypothetical buffered sender for event hubs.

Questions to be answered:

How to report what parts succeeded/failed - single response with two lists; one with successes, one with failures. How would the caller relate requests with failures?
If one or more sub requests failed, should an exception be thrown?
For buffering, how would we report failures?

Additional twists to the story to consider or ignore for now:

a) dependencies between requests in the batch (OData supports this). I have not seen any services in Azure actually make use of this. b) heterogeneous operations - i.e. multiple completely unrelated requests (e.g. a couple of deletes, an upload or two and a get blob metadata properties).

richardpark-msft commented 4 years ago

Do retries complicate this at all? Do they need to trim out "successful" parts of the batch? If there are different failures between batches what does it return?

adrianhall commented 4 years ago

Additional twists: c) client library does not know the batch size prior to sending the first batch? d) encoding overhead (of conversion to JSON or XML, for example)?
e) Timing issues? (you can't safely do the encoding until point of transmission)

Note that each of the various types MIGHT be a separate pattern - we don't have to have a common pattern for all of them.

Concerns around Storage - AggregateException doesn't allow me to tie the exceptions to the original request.

@tg-msft @alzimmermsft @rakshith91 to provide code samples on how to reliably delete 500 blobs in a batch.

We will re-visit on Wednesday.

tg-msft commented 4 years ago

For .NET it would look something like this:

        [Test]
        public async Task Batch_Retries()
        {
            // Create a bunch of blobs and fill a queue with their Uris
            using var scenario = Scenario();
            Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
            const int fullBatchSize = 256;

            // Create a batch
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();
            var batchTracking = new Dictionary<Response, Uri>();
            int batchSize = 0;

            // Keep iterating until all the blobs have been deleted
            while (blobUris.Count > 0)
            {
                // Fill up the batch
                while (blobUris.Count > 0 && batchSize < fullBatchSize)
                {
                    Uri uri = blobUris.Dequeue();
                    Response response = batch.DeleteBlob(uri);
                    batchTracking[response] = uri;
                    batchSize++;
                }

                // Send the batch (and ignore failures since we're processing
                // them ourselves)
                await client.SubmitBatchAsync(batch, throwOnFailure: false);

                // Create a new batch and add any retriable failures to it
                batch = client.CreateBatch();
                batchSize = 0;
                var tracking = new Dictionary<Response, Uri>();
                foreach ((Response previous, Uri uri) in batchTracking.Select(p => (p.Key, p.Value)))
                {
                    // We're considering 503 retriable
                    if (previous.Status == 503)
                    {
                        // Add it to the next batch
                        Response response = batch.DeleteBlob(uri);
                        tracking[response] = uri;
                        batchSize++;
                    }
                }
                batchTracking = tracking;
            }
        }

The annoying bits are tracking the batch size (which we can fix), keeping enough info to replay failed tests (which we can fix), and grouping requests into appropriately sized batches (which we can probably fix).

We should add a public int RequestCount { get; } to BlobBatch so users can at least see how many requests they’ve added. It doesn’t solve the size problem in general, but it would help here and it’s not bad info to give people.
We can add a public BlobBatch GetRetryBatch() to BlobBatch that would use the pipeline’s ResponseClassifier to determine which failures should be retriable and get a batch object that can be sent straight away.
We could potentially solve the size problem in a shared way. What if BlobBatch has a BlobBatch GetUnsentBatch(bool includeRetriableFailures = true) method that would return unsent requests because the size was too big, failures that could be retried, and null when exhausted. We could add a parameter to BlobBatchClient.SubmitBatchAsync that would keep sending until exhausted (defaulting to true).

Here’s how we could simplify the code with 1 and 2:

        [Test]
        public async Task Batch_Retries2()
        {
            // Create a bunch of blobs and fill a queue with their Uris
            using var scenario = Scenario();
            Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
            const int fullBatchSize = 256;

            // Create a batch
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();

            // Keep iterating until all the blobs have been deleted
            while (blobUris.Count > 0)
            {
                // Fill up the batch
                while (blobUris.Count > 0 && batch.RequestCount < fullBatchSize)
                {
                    batch.DeleteBlob(blobUris.Dequeue());
                }

                // Send the batch (and ignore failures - true failures will be
                // completely ignored, retriable failures will be resent)
                await client.SubmitBatchAsync(batch, throwOnFailure: false);

                // Create a new batch and add any retriable failures to it
                batch = batch.GetRetryBatch();
            }
        }

And here’s how we could simplify the code with 3:

        [Test]
        public async Task Batch_Retries3()
        {
            // Create a bunch of blobs and get their Uris
            using var scenario = Scenario();
            var blobUris = await scenario.CreateBlobUrisAsync(1000);

            // Create a batch full of deletes
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();
            foreach (var uri in blobUris)
            {
                batch.DeleteBlob(uri);
            }

            // Send the batch (and ignore failures since we're processing
            // them ourselves while continuing to send retriable failures
            // until exhausted)
            await client.SubmitBatchAsync(batch, throwOnFailure: false, untilExhausted: true);
        }

I don't have the spare mental cycles to think of good names, but hopefully you get the idea.

rakshith91 commented 4 years ago

https://github.com/rakshith91/azure-sdk-for-python/blob/batch_samples/sdk/storage/azure-storage-blob/tests/test_blob_samples_batching.py#L82

An example for python

adrianhall commented 4 years ago

From the architecture board:

1 and #2 from @johanste comment above is in scope for this discussion

3 isn't a partial failure case - so it's not in scope

4 is OUT OF SCOPE for this discussion

Storage, right now, deals with homogeneous batches. We can't rule out heterogenous batches. Johan introduced a new term "smurf".
Heterogenous batches are IN SCOPE for this discussion.

.NET (and Java) There is no way to detect an explodey Response - need an isExplodey() or similar. We would love to have an "isRetryable" flag on the response for error conditions. The code sample #3 gets over the size of batch issue (where the # items to be sent can't be known)

Python Q on whether we should throw on any error in the batch. The point is that the delete_blob vs delete_blobs APIs feel similar, but act differently.

Action Items

@tg-msft will brief the arch board (send links) on the .NET response classifier. Potential arch board discussion.
@tg-msft to add RequestCount to the BlobBatch
@adrianhall what status codes get returned to understand whether feature is being used / what retryable usage is being used. @rakshith91 will update Python libs to match the appropriate throwing semantics. @johanste will look at prior art for batching - should we throw on convenience methods?

The OData 3.0 batching spec is not clear. However, it would be a version bump change if they changed the behavior.

Board decision:

Stay conservative - add minimal, and document the rebatching code.
Partial failures for submitBatch should not throw by default (the batch was successfully submitted), but look at prior art in specific languages.
If the service returned 202 but its actually a failure, throw an error

We need some more info from @johanste before closing on whether convenience methods should throw or not on a partial failure.

We need this in before the API breaking change window closes (maybe next Thursday)

adrianhall commented 4 years ago

Link to arch board recording: https://msit.microsoftstream.com/video/95f73359-e914-408f-b3f1-9fb9ac5a364f

Azure / azure-sdk

Change Request: Success but some failed response handling #580

1 and #2 from @johanste comment above is in scope for this discussion

3 isn't a partial failure case - so it's not in scope

4 is OUT OF SCOPE for this discussion

Action Items