Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.52k stars 293 forks source link

DurableTaskStorageException: Value for one of the query parameters specified in the request URI is invalid #809

Open joerage opened 1 year ago

joerage commented 1 year ago

We are running in an issue with DTFx. Sometimes (and only in one region where our service is running - WestEurope), when enqueuing a queue message to start an orchestration, the queueing fails because of a bad parameter in the query string.

Sample query: https://mystorage.queue.core.windows.net:443/devcenterjobs-control-00/messages?messagettl=%E2%88%921

Error and stacktrace:

DurableTask.AzureStorage.Storage.DurableTaskStorageException: Value for one of the query parameters specified in the request URI is invalid. ---> Microsoft.WindowsAzure.Storage.StorageException: Value for one of the query parameters specified in the request URI is invalid. at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token) at DurableTask.AzureStorage.Storage.AzureStorageClient.WrapFunctionWithReturnType(Func3 storageRequest, OperationContext context, CancellationToken cancellationToken) in //src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 158 at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId) at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in //src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 149

Looking at the code here, the value of the messagettl param should always be -1 (as we are not using NET462). And looking at storage client code, it should append it.

Any help to resolve this issue would be appreciated.

davidmrdavid commented 1 year ago

Thanks for reaching out @joerage. Just to be sure I understand this issue before jumping into it: you seem to emphasize the value of the message time-to-live, but nothing about the trace makes it immediately obvious to me why time-to-live may be at fault here. Can you please clarify this? Thanks!

joerage commented 1 year ago

Hi @davidmrdavid, I simply making this assumption because the call made to storage dataplane has a single query parameter: https://mystorage.queue.core.windows.net:443/devcenterjobs-control-00/messages?messagettl=%E2%88%921

davidmrdavid commented 1 year ago

@amdeel - do you have any advice on this? Perhaps we can add some more telemetry here to confirm that messagettl is in fact the parameter that is invalid, as the exception doesn't make this fully clear. From there, the next steps in investigation would be clearer.

jviau commented 1 year ago

When you decode that URL, it does decode to -1: %E2%88%921 == -1. Is this code actually consistent (regardless of correctness)?

If this is happening on only one region, is it an issue with Azure storage in that region? @joerage, can you try crafting a request with this that query param on it and sending it to different regions of Azure Storage? Asking because I imagine you have a few instances of Azure Storage readily available to test it out. If not, we can try that soon.

Recommended actions:

  1. Lets first verify if the DurableTask.AzureStorage code is inconsistent or not with this call stack.
    • Do we encode only sometimes?
  2. See how AzureStorage in different regions behaves with the encoded and not-encoded request.
  3. Should DurableTask.AzureStorage actually be leaving the messagettl query param off entirely when it is -1?
joerage commented 1 year ago

We found some more information since my last post.

  1. We saw the same encoding as you did.
  2. Every time the request to storage is sent without encoding, it is a valid request.
  3. Every time the request to storage gets its param encoded, it fails with 400.
  4. After all, it does not only happen in a single region. We have seen in some 8 regions. Although much more frequently in that single WestEurope region.
  5. In 2 different cases, multiple HTTP requests were received by multiple services in multiple regions. These requests were all correlated to a single incoming request (from ARM; then RPaaS fanned them out to all our regions). And they all failed the same way!!
  6. Only a few different customers are hitting this issue, but those customers see repeated such failures.

To your questions

  1. Encoding only happens sometimes, but no idea why.
  2. If param is encoding, they all fail, in all regions, otherwise it is good.
  3. Omitting this param will use the default TTL of 7 days, instead of infinite.

Edit: We figured out the issue. The problem is in the StorageAccount SDK, which uses long?.ToString() without specifying an invariant culture. Our service is using a ASP.NET Core middleware to set current culture from the incoming request ([app.UseRequestLocalization()](https://learn.microsoft.com/en-us/dotnet/api/microsoft.aspnetcore.builder.applicationbuilderextensions.userequestlocalization?view=aspnetcore-6.0)). We don't support all incoming cultures, just a subset. Still, in our case, one specific culture is causing the issue. But all the following culture will cause -1 to become encoded:

Culture: Estonian Culture: Estonian (Estonia) Culture: Basque Culture: Basque (Spain) Culture: Finnish Culture: Finnish (Finland) Culture: Faroese Culture: Faroese (Denmark) Culture: Faroese (Faroe Islands) Culture: Swiss German Culture: Swiss German (Switzerland) Culture: Swiss German (France) Culture: Swiss German (Liechtenstein) Culture: Colognian Culture: Colognian (Germany) Culture: Lithuanian Culture: Lithuanian (Lithuania) Culture: Norwegian Bokmål Culture: Norwegian Bokmål (Norway) Culture: Norwegian Bokmål (Svalbard & Jan Mayen) Culture: Norwegian Nynorsk Culture: Norwegian Nynorsk (Norway) Culture: Romansh Culture: Romansh (Switzerland) Culture: Northern Sami Culture: Northern Sami (Finland) Culture: Northern Sami (Norway) Culture: Northern Sami (Sweden) Culture: Slovenian Culture: Slovenian (Slovenia) Culture: Swedish Culture: Swedish (Åland Islands) Culture: Swedish (Finland) Culture: Swedish (Sweden)

Our solution is to set the culture to InvariantCulture right before starting the orchestration (and setting it back right after).

CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;

To repro, simply set the culture before starting the orchestration e.g:

CultureInfo.CurrentCulture = new CultureInfo("sv");
CultureInfo.CurrentUICulture = new CultureInfo("sv");
var instance = await client.CreateOrchestrationInstanceAsync(name, version, input);

Here is a small program showing the issue (thanks @chrissmiller!)

using System;
using System.Collections.Generic;   
using System.Globalization;
public class Program
{
    public static void Main()
    {
        TimeSpan? timeToLive = TimeSpan.FromSeconds(-1);
        long timeToLiveInSeconds = (long)timeToLive.Value.TotalSeconds;
        // Display value using some standard format specifiers.
        var fmts = new List<string>(){
            "D"
        };
        var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
        foreach (CultureInfo culture in cultures){
            foreach (string fmt in fmts){
                test(timeToLiveInSeconds, fmt, culture);    
            }
        }
    }
    public static void test(long val, string format, CultureInfo culture){
        string a = val.ToString(format, culture);
        if (Uri.EscapeUriString(a) == "%E2%88%921"){
            //Console.WriteLine(format);
            Console.WriteLine($"Culture: {culture.DisplayName}");
            Console.WriteLine($"String value: {a}");
            Console.WriteLine($"URI Escaped Value: {Uri.EscapeUriString(a)}");
        }
    }
}
chrissmiller commented 1 year ago

Not certain where the fix is, but the issue doesn't appear to repro on the newer Storage SDKs (e.g., Azure.Storage.Queues)

davidmrdavid commented 1 year ago

Wow, amazing find @chrissmiller! @amdeel, @jviau: do we think we could implement a similar or equivalent trick to ignore the user-specified CultureInfo when encoding our requests to Azure Storage?

cgillum commented 1 year ago

do we think we could implement a similar or equivalent trick to ignore the user-specified CultureInfo when encoding our requests to Azure Storage?

We're in the process of migrating to modern versions of the Azure Storage SDK. Assuming the underlying issue is fixed there, it might make sense to just rely on the SDK upgrade rather than pushing a short-lived fix to work around issues in the deprecated SDKs.

jviau commented 1 year ago

I believe it is addressed in the newer SDKs.

As for working around this in existing - two things to check:

  1. Is it fixed in newer versions of the deprecated SDK?
  2. If not, we could fix it by pushing a known working culture onto the execution context before calling any storage APIs (and then restoring the previous culture after the call is complete)

@cgillum, I think the SDK upgrade will warrant a major version rev. We may want to consider fixing this in the existing version, just in case the migration from DurableTask.AzureStorage v1 to v2 is not frictionless and some customers are blocked. They may still benefit from this fix in v1. Not saying that is the case, but something to consider.