Azure / azure-webjobs-sdk

Azure WebJobs SDK
MIT License
738 stars 358 forks source link

Continuous EventHub webjob fails with StorageException (TimeoutException) and holds up subsequent processing. #983

Open Zoidborg7 opened 7 years ago

Zoidborg7 commented 7 years ago

I have a similar error to issue #953, but I do not believe it is related to CPU load (our CPU load is always low). I believe it has to do with delays or overloads perhaps with the Storage Blob connection for the Dashboard logging information.

I have a few web jobs that consume event hub messages in batches. At times, there are delays to messages being consumed, and I find these TimeoutException errors in the diagnostic log. Several every hour.

After some analysis over the past week, I believe it is caused by the mechanisms that upload webjob dashboard data. I disabled all dashboard interaction by setting the JobHostConfiguration.DashboardConnectionString to null. Since I did this for all my webjobs, I have had no further time delay gaps in the processing of the event hub messages.

I am using the 2.00-beta2 versions of the WebJob SDK and Service Bus package. All non-webjob-SDK nuget packages are the latest production release as of 1/18/2017.

This is the declaration for the EventHubTrigger-based WebJob function:

        public static async Task ProcessGeoTraqrEventHubMessagesAsync (
            [EventHubTrigger("GeoTraqrEventHub")] EventData[] geoTraqrEventDatas,
            CancellationToken cancellationToken);

The following CSV log file excerpt shows the exception detail for when the TimeoutException occurs:

date,level,applicationName,instanceId,eventTickCount,eventId,pid,tid,message,activityId
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,Executed: 'GeoTraqrProcessor.ProcessGeoTraqrEventHubMessagesAsync' (Failed),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   --- End of inner exception stack trace ---,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.WindowsAzure.Storage.Blob.CloudBlobContainer.EndCreate(IAsyncResult asyncResult),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.WindowsAzure.Storage.Blob.CloudBlobContainer.EndCreateIfNotExists(IAsyncResult asyncResult),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,--- End of stack trace from previous location where exception was thrown ---,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.Azure.WebJobs.Host.Protocols.PersistentQueueWriter`1.<EnqueueAsync>d__0.MoveNext(),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,--- End of stack trace from previous location where exception was thrown ---,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.Azure.WebJobs.Host.Loggers.CompositeFunctionInstanceLogger.<LogFunctionStartedAsync>d__0.MoveNext(),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,--- End of stack trace from previous location where exception was thrown ---,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__1a.MoveNext(),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,--- End of stack trace from previous location where exception was thrown ---,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<TryExecuteAsync>d__2.MoveNext(),
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,Request Information,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,RequestID:,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,RequestDate:,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,StatusMessage:,
2017-01-19T14:03:20,Information,my-application-name,6d3b73,636204314003508575,0,12268,39,,

The stack trace indicates that it is entirely SDK contained code, most likely executing after the triggered webjob function has completed. I know that several blobs are updated and/or created in the dashboard every time a function executes. With the traffic that we get in our event hubs, these jobs have a function execution occurring several times every second at times, usually at least once every second give or take a few ms.

So now without the Dashboard storage connection, everything works great. At least as far as the event hub message processing jobs are concerned. This isn't an ideal solution, however, basically shutting off access to dashboard storage. The primary problem will be with the continuous webjob I have that runs on a timer trigger and is also marked as a Singleton. I know that the blob-based lock for the singleton uses the dashboard storage connection, so that won't work if I scale horizontally. For now I will have to use the database locking system and custom code the singular instance enforcement for the timer triggered job function. As I mentioned, not an ideal solution.

brettsam commented 7 years ago

It does look to be coming from the dashboard logging code -- simply checking whether the container exists is timing out. We have a similar issue filed with the QueueListener here: https://github.com/Azure/azure-webjobs-sdk/issues/944. For starters, I don't think that should be an operation that brings down the host. It seems like we could log a warning/error and continue on.

We've seen these kinds of failures before when you have high CPU and async callbacks (mostly timers in our experience) aren't able to execute in time due to thread contention. We've theorized that if you are making too many network calls that you could see similar behavior.

Could you give me an overview of what your webjobs are doing (do you make 1+ network calls for each Event Hub message, for example)? What kind of plan are you running on? How many EventHub messages are you processing (ballpark)?

Zoidborg7 commented 7 years ago

This is probably an issue we have had for a while that was not exposed until we started checking the "health" of our devices by looking at the last time their status was updated in our database. When we started doing this we found these gaps where a message processing webjob would be held up, and those hold-ups correlated to the instances of the TimeoutException.

There are a handful of webjobs that each process a different event hub. The devices from which the messages originate send different types of messages to the different event hubs so they can be processed independently. The number of messages depends on the frequency of the message types sent from the devices and the number of devices online. Based on a rough count from the log produced by the message handler (which takes the messages from the devices then distributes them to the event hubs), there were about 12-15 messages across all types per second during the time of the log excerpt included above. The message handler, by the way, has been operating smoothly and continuously without any problems getting messages into the event hubs.

Each of the message processing webjobs use a similar pattern as seen in the declaration excerpted above. They take messages in batches and spin up threads to process each of the messages. For each thread there is a database connection used for Azure SQL to process the message and store changes to the data model indicated by the message. The database uses EF for SQL, and there is one single SaveChanges() call to the data context for each processing thread.

The message handler and the webjobs all run on the same app service. The message handler is a websocket handler in the web application, and the webjobs are associated with that web application. The App Service plan hosting the app service uses Standard: 1 Large pricing tier, and the CPU usage is consistently 10-12%. In the past we were using a lesser tier, and our CPU loads were too high and the processes were bogging down. The symptoms of this bogging down, however, were that event hub messages were not being consumed fast enough. There were no exceptions evident when this was happening.

For the Azure SQL database, it has had consistently low % for DTU use as well as memory use. DTU% is consistently between 10-12% and memory use is consistently 28-32%. The database is Standard:1 Large tier.

Basically, we have focused on the code and database design as well as then picking the appropriate pricing tiers to handle our projected needs for the near term with room for the unexpected. Responsiveness is top priority for this application. We've moved from message queues to the event hub as well and switched our webjob functions from single message processing to batch processing months ago which has increased our message handling capability significantly.

When we had single message processing, we did notice that the event hubs would bog down even though CPU and DTU percentage was low. We determined that the most likely cause was the underlying storage blob use by the individual web job function calls, and so we moved to batch event hub triggered functions. This solved that bog down problem. So it has been kinda in the back of our minds that there could be a problem associated with the storage account.

I was at one time using storage blob locks to guarantee exclusive access to any particular device's data record(s) for multiple message processing threads that could be concurrent. That proved to be completely unreliable, and so I moved to a SQL-based exclusivity system in part based on the BLOB lock architecture. This has been very successful. I mention this just to point out again a possible weakness in blob storage access. The problem we would have with the blob locks were that they would not be reliably released even though the lock release calls were executing apparently without error.

I hope this provides enough information; I'm happy to provide anything I can that will help. :-)

Zoidborg7 commented 7 years ago

As I mentioned before, we eliminated the storage blob access by blanking the dashboard storage connection string. Since we did that, there have been zero bog-downs in any of the message processors. We went from having tens or even hundreds of alerts caused by delays on the event hub processors causing database records to not be updated to now absolutely zero such alerts. Good news for the application, but it is of course preferable to be able to use the subsystems that some features depend on. Fortunately for now, bypassing the dashboard doesn't hurt anything.

karachapman commented 7 years ago

I just verified that I am having the same issue that Zoidborg7 had. I have a continuous webjob (triggered by a Queue message) that works fine when deployed to our development environment, but was almost always failing when deployed to our load testing environment. I stumbled upon this post just now and figured I would try disabling the dashboard as last ditch effort to get the web job to run consistently. Of course, that has resolved the issue. Prior to doing this, I kept getting this exception for the deployed web job running in my load testing environment:

Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.

One other thing to note, the web job always worked while running locally using my load testing data. When deployed, it would get a status of "Never Finished" over and over again, sometimes working on the 5th or 6th try (taking about 10 minutes or so to try again). I tried all sorts of things to fix the issue (such as removing logging with the TextWriter and using console logs instead, increasing various timeouts, etc), but in the end it was the Azure WebJobs Dashboard that was causing the problem. Needless to say, not being able to monitor a web job that we put into production is not a viable solution. Is there any chance of this issue being resolved anytime soon? I am currently using Microsoft Azure WebJobs version 1.1.2. Please let me know if I can provide any more details to help resolve the issue.

Zoidborg7 commented 7 years ago

@karachapman -- I definitely had executions that lasted much longer than the actual execution time of the function code, and now that you mentioned it, I do believe I would get the occasional Never Finished result. Though usually I would get the Failed result when I received the error message.

brettsam commented 7 years ago

Sorry for the delay on this, let me see if I can get some more details. First, I'd recommend you both move up to 2.0.0 release, but I'm not sure that this will have been fixed there.

I'm wondering if the timeout you're hitting is client- or server-side. My guess is that it's client-side. Did either of you try setting ServicePointManager.DefaultConnectionLimit = Int32.MaxValue; as the first line in your Main method?

@karachapman -- you mention that you have a QueueTrigger function running that exhibits this behavior. What does that function do? How much logging does it do? Also, what kind of load do you throw at it in order to hit the issue? I'd like to try to build myself a repro. If you're willing to share the code with me (via email on my profile if you'd like) that would help, but I understand if that's not possible.

karachapman commented 7 years ago

Thanks for the response. The 2.0.0 release has since fixed the issue!

On Tue, Mar 7, 2017 at 3:10 PM, Brett Samblanet notifications@github.com wrote:

Sorry for the delay on this, let me see if I can get some more details. First, I'd recommend you both move up to 2.0.0 release, but I'm not sure that this will have been fixed there.

I'm wondering if the timeout you're hitting is client- or server-side. My guess is that it's client-side. Did either of you try setting ServicePointManager.DefaultConnectionLimit = Int32.MaxValue; as the first line in your Main method?

@karachapman https://github.com/karachapman -- you mention that you have a QueueTrigger function running that exhibits this behavior. What does that function do? How much logging does it do? Also, what kind of load do you throw at it in order to hit the issue? I'd like to try to build myself a repro. If you're willing to share the code with me (via email on my profile if you'd like) that would help, but I understand if that's not possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/azure-webjobs-sdk/issues/983#issuecomment-284860095, or mute the thread https://github.com/notifications/unsubscribe-auth/AYtKooqYoSkadi9c_eJk7sEDriJrQt2iks5rjce_gaJpZM4LrT6r .

-- Kara Chapman Software Developer HCSS kara.chapman@hcss.com 713-270-4000

ArnimSchinz commented 7 years ago

We have the exact same problem. Every now and then the webjob stops importing and delays for 5 minutes and in some cases we get the exception posted above. Also (in the webjob dashboard) it looks like it took the webjob 5 minutes to finish some events, but this is not possible, since we have log commands at the end of the processor, that show the real end of the execution (just a few seconds after the start). Setting the dashboard connection string to null solved it for us. We've tried it with multiple WebJob deployments and nothing else worked. We are currently using the 2.0.0 release and still have to use the work around. We've also tried the 2.1 beta, but the problem still exists there. Thank you very much for the hotfix, i hope the problem will be fixed in the next release..

bz0108 commented 7 years ago

@ArnimSchinz

How about have a test by,

write the following code at the beginning of your app, or at least before calling JobHost.RunAndBlock(if you calls it),

        System.Threading.ThreadPool.SetMinThreads(200, 200);
        System.Net.ServicePointManager.DefaultConnectionLimit = 1920;

Then test it.

pisethdanh commented 7 years ago

We are encountering this issue as well. We are currently using Microsoft.Azure.WebJobs v2.0.0. Following the suggestion to increase the minimum number of threads the thread pool creates, and the default connection limit did not work for us. Would changing our Web Job to an Azure Function App help us in any way? Or will there be a fix for this coming in the near future?

jorgst commented 7 years ago

We have the same issue (v2.1.0-beta1)

OzBob commented 6 years ago

@paulbatum recommends "that you stick with storage sdk 7.x while on version 2.x of webjobs. The upcoming version 3.x of webjobs will target storage sdk 8.x" in #976 (my question is 'is v9 on the roadmap?')

I have the same issue on storage 7.2.1 and webjobs1.1.2 - I will attempt to upgrade to webjobs 2 https://github.com/Azure/azure-storage-net/blob/master/changelog.txt has no breaking changes from 7->9x but there are some reported as issues. I'll follow @paulbatum's recommendation and not upgrade.

FYI https://github.com/Azure/azure-webjobs-sdk/releases?after=v2.1.0-beta1-10950 shows some WebJob breaking changes from 1.x to 2:

  1. BlobTrigger poison queue changes in multi storage account scenarios (breaking) #804
  2. Made TraceFilter.Events a method that returns a snapshot, rather than it being a live collection as it was before (breaking)
  3. SendGrid binding moved to latest SendGrid v3 API (Breaking)
  4. Make IValueProvider.GetValue async (breaking). This is only an issue for extension authors
lovettchris commented 3 years ago

I was having the same problem with Azure Functions (built on WebJobs) until I switched to a more expensive VM that runs my job with more than 1 core. This way the health monitoring is not blocked by the work my job does. I would argue that since web jobs architecture requires health monitoring, you should pay for (and automatically configure) that extra core to get health information back from my job instead of me having to worry about it.