Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Delay decoding a message's contents in DTFx.AzureStorage #1110

Open davidmrdavid opened 1 month ago

davidmrdavid commented 1 month ago

We recently encountered a case where the Azure Functions scale controller was failing to make a scaling decision due to the following error:

System.FormatException: The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.
   at System.Convert.FromBase64CharPtr(Char* inputPtr, Int32 inputLength)
   at System.Convert.FromBase64String(String s)
   at Microsoft.WindowsAzure.Storage.Queue.CloudQueueMessage.get_AsString()
   at DurableTask.AzureStorage.Storage.QueueMessage..ctor(CloudQueueMessage cloudQueueMessage) in /_/src/DurableTask.AzureStorage/Storage/QueueMessage.cs:line 26
   at DurableTask.AzureStorage.Storage.Queue.PeekMessageAsync() in /_/src/DurableTask.AzureStorage/Storage/Queue.cs:line 193
   at DurableTask.AzureStorage.Monitoring.DisconnectedPerformanceMonitor.GetQueueLatencyAsync(Queue queue) in /_/src/DurableTask.AzureStorage/Monitoring/DisconnectedPerformanceMonitor.cs:line 243

As can be seen in the stack trace, this error is thrown when trying to Peek a given queue, which we use to observe the message age amongst other metadata.

The error occurs because, after receiving an azure queue message (of type CloudQueueMessage), we wrap it in our own abstraction called QueueMessage. When constructing this object, we call the AsString method of CloudQueueMessage, which obtains the message contents, and store them in a class property named Message. This method can throw if the message cannot be decoded, yielding the error above.

This is a kind of poison message scenario that can affect the Scale Controller. In the Scale Controller's case, we don't really need to decode the message's contents, as all we need is message and queue metadata such as the average message Age, the number of messages in the queue, and so on.

As a result - this PR moves the call to AsString from the CloudQueueMessage constructor to instead occur in the implementation of it's Message property. This will prevent the ScaleController from error'ing out, as it never needs to access the message's actual contents.

As an aside - I meant to write a test for this, but the CloudQueueMessage type is sealed, so it cannot be mock'ed, and newer versions of the Azure Storage SDK are more robust against undecodable errors. So I decided against it.