Azure / azure-sdk

This is the Azure SDK parent repository and mostly contains documentation around guidelines and policies as well as the releases for the various languages supported by the Azure SDK.
http://azure.github.io/azure-sdk
MIT License
478 stars 295 forks source link

Change Request: Success but some failed response handling #580

Closed bterlson closed 4 years ago

bterlson commented 5 years ago

Need guidelines for sending a sequence of things where the first N things succeed before there's a failure.

johanste commented 4 years ago

Different "batch"-like scenarios:

1) Single HTTP request, request body contains multiple items that will be operated upon. Each item can succeed/fail independently in addition to the whole envelope/request failing. Example: text analytics. 2) HTTP multi-part-mixed/http, multiple "nested" HTTP requests. Variation of 1). Example, storage batch. 3) Streaming/controlling "transactional" batches. Either all message succeed or all fail. Example, event hubs. 4) Streaming/improving efficiency. Lift burden of gathering messages/requests from the client application into the client library. Example: hypothetical buffered sender for event hubs.

Questions to be answered:

Additional twists to the story to consider or ignore for now:

a) dependencies between requests in the batch (OData supports this). I have not seen any services in Azure actually make use of this. b) heterogeneous operations - i.e. multiple completely unrelated requests (e.g. a couple of deletes, an upload or two and a get blob metadata properties).

richardpark-msft commented 4 years ago

Do retries complicate this at all? Do they need to trim out "successful" parts of the batch? If there are different failures between batches what does it return?

adrianhall commented 4 years ago

Additional twists: c) client library does not know the batch size prior to sending the first batch? d) encoding overhead (of conversion to JSON or XML, for example)?
e) Timing issues? (you can't safely do the encoding until point of transmission)

Note that each of the various types MIGHT be a separate pattern - we don't have to have a common pattern for all of them.

Concerns around Storage - AggregateException doesn't allow me to tie the exceptions to the original request.

@tg-msft @alzimmermsft @rakshith91 to provide code samples on how to reliably delete 500 blobs in a batch.

We will re-visit on Wednesday.

tg-msft commented 4 years ago

For .NET it would look something like this:

        [Test]
        public async Task Batch_Retries()
        {
            // Create a bunch of blobs and fill a queue with their Uris
            using var scenario = Scenario();
            Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
            const int fullBatchSize = 256;

            // Create a batch
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();
            var batchTracking = new Dictionary<Response, Uri>();
            int batchSize = 0;

            // Keep iterating until all the blobs have been deleted
            while (blobUris.Count > 0)
            {
                // Fill up the batch
                while (blobUris.Count > 0 && batchSize < fullBatchSize)
                {
                    Uri uri = blobUris.Dequeue();
                    Response response = batch.DeleteBlob(uri);
                    batchTracking[response] = uri;
                    batchSize++;
                }

                // Send the batch (and ignore failures since we're processing
                // them ourselves)
                await client.SubmitBatchAsync(batch, throwOnFailure: false);

                // Create a new batch and add any retriable failures to it
                batch = client.CreateBatch();
                batchSize = 0;
                var tracking = new Dictionary<Response, Uri>();
                foreach ((Response previous, Uri uri) in batchTracking.Select(p => (p.Key, p.Value)))
                {
                    // We're considering 503 retriable
                    if (previous.Status == 503)
                    {
                        // Add it to the next batch
                        Response response = batch.DeleteBlob(uri);
                        tracking[response] = uri;
                        batchSize++;
                    }
                }
                batchTracking = tracking;
            }
        }

The annoying bits are tracking the batch size (which we can fix), keeping enough info to replay failed tests (which we can fix), and grouping requests into appropriately sized batches (which we can probably fix).

  1. We should add a public int RequestCount { get; } to BlobBatch so users can at least see how many requests they’ve added. It doesn’t solve the size problem in general, but it would help here and it’s not bad info to give people.

  2. We can add a public BlobBatch GetRetryBatch() to BlobBatch that would use the pipeline’s ResponseClassifier to determine which failures should be retriable and get a batch object that can be sent straight away.

  3. We could potentially solve the size problem in a shared way. What if BlobBatch has a BlobBatch GetUnsentBatch(bool includeRetriableFailures = true) method that would return unsent requests because the size was too big, failures that could be retried, and null when exhausted. We could add a parameter to BlobBatchClient.SubmitBatchAsync that would keep sending until exhausted (defaulting to true).

Here’s how we could simplify the code with 1 and 2:

        [Test]
        public async Task Batch_Retries2()
        {
            // Create a bunch of blobs and fill a queue with their Uris
            using var scenario = Scenario();
            Queue<Uri> blobUris = new Queue<Uri>(await scenario.CreateBlobUrisAsync(1000));
            const int fullBatchSize = 256;

            // Create a batch
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();

            // Keep iterating until all the blobs have been deleted
            while (blobUris.Count > 0)
            {
                // Fill up the batch
                while (blobUris.Count > 0 && batch.RequestCount < fullBatchSize)
                {
                    batch.DeleteBlob(blobUris.Dequeue());
                }

                // Send the batch (and ignore failures - true failures will be
                // completely ignored, retriable failures will be resent)
                await client.SubmitBatchAsync(batch, throwOnFailure: false);

                // Create a new batch and add any retriable failures to it
                batch = batch.GetRetryBatch();
            }
        }

And here’s how we could simplify the code with 3:

        [Test]
        public async Task Batch_Retries3()
        {
            // Create a bunch of blobs and get their Uris
            using var scenario = Scenario();
            var blobUris = await scenario.CreateBlobUrisAsync(1000);

            // Create a batch full of deletes
            BlobBatchClient client = scenario.GetBlobBatchClient();
            BlobBatch batch = client.CreateBatch();
            foreach (var uri in blobUris)
            {
                batch.DeleteBlob(uri);
            }

            // Send the batch (and ignore failures since we're processing
            // them ourselves while continuing to send retriable failures
            // until exhausted)
            await client.SubmitBatchAsync(batch, throwOnFailure: false, untilExhausted: true);
        }

I don't have the spare mental cycles to think of good names, but hopefully you get the idea.

rakshith91 commented 4 years ago

https://github.com/rakshith91/azure-sdk-for-python/blob/batch_samples/sdk/storage/azure-storage-blob/tests/test_blob_samples_batching.py#L82

An example for python

adrianhall commented 4 years ago

From the architecture board:

1 and #2 from @johanste comment above is in scope for this discussion

3 isn't a partial failure case - so it's not in scope

4 is OUT OF SCOPE for this discussion

Storage, right now, deals with homogeneous batches. We can't rule out heterogenous batches. Johan introduced a new term "smurf".
Heterogenous batches are IN SCOPE for this discussion.

.NET (and Java) There is no way to detect an explodey Response - need an isExplodey() or similar. We would love to have an "isRetryable" flag on the response for error conditions. The code sample #3 gets over the size of batch issue (where the # items to be sent can't be known)

Python Q on whether we should throw on any error in the batch. The point is that the delete_blob vs delete_blobs APIs feel similar, but act differently.

Action Items

The OData 3.0 batching spec is not clear. However, it would be a version bump change if they changed the behavior.

Board decision:

We need some more info from @johanste before closing on whether convenience methods should throw or not on a partial failure.

We need this in before the API breaking change window closes (maybe next Thursday)

adrianhall commented 4 years ago

Link to arch board recording: https://msit.microsoftstream.com/video/95f73359-e914-408f-b3f1-9fb9ac5a364f