Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.26k stars 4.61k forks source link

ClientModel: ClientRetryPolicy should consider `retry-after` header #44222

Open annelo-msft opened 4 months ago

annelo-msft commented 4 months ago

Azure.Core's retry policy uses the default DelayStrategy implementation to check and wait to retry based on the response RetryAfter header. Since we don't have DelayStrategy in SCM, we did not implement this logic in ClientRetryPolicy. We should implement this, though, and may be able to address this issue as part of the addition of LRO abstractions to SCM in https://github.com/Azure/azure-sdk-for-net/issues/42114. We could have an internal type that implements exponential backoff and respects the RetryAfter header by default, and make it overridable if someone specifies a fixed polling interval. Subtypes of the retry policy or LRO implementations from TypeSpec could then override this behavior if needed for their service-specific implementations.

github-actions[bot] commented 4 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

Pilchie commented 1 month ago

Sorry to bump this old issue - I've been using Azure OpenAI, and I notice that it uses SCM's ClientPiplelinePolicy, but OpenAI's rate limiting does specify the retry after header. Any chance you know of an implementation of an SCM retry policy that does take this into account?

annelo-msft commented 1 month ago

Hi @Pilchie - a fix for this issue was released in SCM 1.1.0-beta.6. If you reference this version, do you still see the issue?

Pilchie commented 2 weeks ago

Hi @annelo-msft - thanks for looking at this, and sorry for the delay. Today I tried out pinning SCM 1.1.0-beta7 and seeing what happened. Unfortunately, with the default retry policy, this doesn't actually help. Looking at the code for GetNextDelay, I think the issue is that tryCount is 0, so nextDelayMilliseconds ends up being 1 day, and the retryAfter.TotalMilliseconds is smaller than nextDelayMilliseconds.

Pilchie commented 2 weeks ago

Nevermind - I am getting an x-ms-retry-after of 1 day, so that seems ok.

I think there might still be an issue with message.RetryCount being zero leading to a negative time to wait, which means there will be an immediate retry the first time, but that's not as big an issue to me.

annelo-msft commented 2 weeks ago

Hi @Pilchie -- x-ms-retry-after is an Azure-specific header, so not supported by default by SCM, because SCM provides general HTTP support, but not Azure-specific features. For Azure.AI.OpenAI, @trrwilson was adding capabilities into that client to support the Azure-specific features -- I don't know what the status of that is currently -- if that PR has been merged and what client version has it. @trrwilson could you share that info?

I think there might still be an issue with message.RetryCount being zero leading to a negative time to wait, which means there will be an immediate retry the first time, but that's not as big an issue to me.

This is interesting, and if there's a bug in SCM, I'd love to know about it so we can get it fixed! Would you be willing to share a quick repro case for this?

Pilchie commented 1 week ago

Interesting - I'm not sure how the delay got set to 1 day. Maybe both headers were sent. As for the repro, I was using the code below, with an Azure OpenAI endpoint set to the minimum rate limit:

using Azure.AI.OpenAI;
using Azure.Identity;
using OpenAI.Embeddings;
using System.ClientModel.Primitives;

const string endpointEnvironmentVariableName = "AZURE_OPENAI_ENDPOINT";
const string apiKeyEnvironmentVariableName = "AZURE_OPENAI_APIKEY";

var endpoint = Environment.GetEnvironmentVariable(endpointEnvironmentVariableName) ?? throw new Exception($"Set '{endpointEnvironmentVariableName}'.");
var apiKey = Environment.GetEnvironmentVariable(apiKeyEnvironmentVariableName) ?? throw new Exception($"Set '{apiKeyEnvironmentVariableName}'.");

var client = new AzureOpenAIClient(
    new Uri(endpoint),
    apiKey,
    new AzureOpenAIClientOptions
    {
        RetryPolicy = new ClientRetryPolicy(),
    });

var embeddingClient = client.GetEmbeddingClient("text-embedding-3-large");

const int count = 100;
var inputs = new List<string>(count);
for (int i = 0; i < count; i++)
{
    inputs.Add($"This is a string to embed. It is number {i}.");
}
var result = embeddingClient.GenerateEmbeddings(inputs);

Console.WriteLine(result.GetRawResponse().Status);

Note that I was unable to get a breakpoint to hit inside the retry method, or to step into it, so it's possible something else was happening - I could only inspect the state once it was in the 1-day Wait call.

Pilchie commented 1 week ago

Confirmed that RetryAfter was the header that was sent (in addition to x-ratelimit-reset-tokens and x-ratelimit-remaining-requests)