Azure / azure-webjobs-sdk

Azure WebJobs SDK
MIT License
737 stars 358 forks source link

WebJobs dashboard maxing out Azure App Service CPU & RAM #1050

Open Bio2hazard opened 7 years ago

Bio2hazard commented 7 years ago

Copied from https://github.com/projectkudu/kudu/issues/2351 by recommendation from @davidebbo

Summary: We have a very active web job, we didn't visit it's dashboard for a long time. Upon visiting the dashboard, the kudu w3wp.exe strangled our app service for 7 days. Solution was to delete all dashboard related blobs, hard-killing the scm w3wp.exe process and restarting the web app. I estimate that we had roughly 1 million function executions between the dashboard visits.

Proposed solution: Automatically purge dashboard blob storage files & logs older than X days. That way the data the dashboard analyzes when visiting remains manageable. Additionally, this would also deal with the annoying tendency of web jobs racking up growing storage costs.


Original issue submitted to Kudu: Unfortunately this will be pretty vague as I had to deal with solving the issue and thus could not exactly investigate what caused it to begin with, but I will try to give as much information as I can and I hope it can be found.

We have a small app service plan that 7 days ago went from it's usual 10% CPU and 70% RAM to 30% CPU and 90+% RAM. Response time for a website hosted on the app service went from a couple of ms to 100-120 seconds.

Using the "Metrics per Instance" dashboard, we were able to trace the issue to our web jobs project.

Kudu was unresponsive, only after restarting the web app were we able to access the Kudu interface. In the process tab, we noticed that our individual web jobs were barely consuming cpu time or memory, but the w3wp.exe that hosts them was at a staggeringly high 6000s+ cpu time and 1.3 gb working set memory and 1.8 gb private memory.

We scaled the azure app service up to a medium instance, and the w3wp.exe process immediately jumped up to over 3 gb of working set memory and private memory, maxing out the medium instance.

I downloaded a minidump and profiled the w3wp.exe process, but was unable to determine where the memory was going. However, I was able to trace the CPU time usage to Newtonsoft's JSON.Net serializing.

Next, we stopped our web app completely. The "Metrics per Instance" app service remained at a high CPU and memory usage, but all sites listed under "Sites In Service Plan" showed minimal CPU and memory usage.

I think it started when we looked at the web job function dashboard. It had to index some 5000 items. We run a lot of small jobs, one had over 1.1 million recorded function invocations.

To fix it, I deleted all web jobs related blod storages, killed the offending w3wp.exe process from the kudu dashboard and restart the web app. So far so good.

Here are some charts showing a high app service load despite none of our sites consuming any resources.

Process Tab after turning off Web App image

Web App Turned Off image

Sites on the App Service image

App Service Load not dipping image

Extra Web Job Information: Continuous Jobs, Web Job functions triggered through azure storage queue messages. Deployed through Visual Studio, written in .NET using the Web Jobs SDK.


One more symptom that I neglected to mention is that the local read bytes was through the roof.

Here is a last 24-hour chart showing a staggering 81 gb local read bytes on the Web Job. image


I pulled the chart from the Azure Portal. When looking at the Web App, there is a Dashboard called "Metrics per Instance (App Service plan)". The last chart in the list is the Networks Statistics chart I pasted.

Edit: It's looking more and more likely that it was caused by opening the web jobs kudu dashboard. After purging everything earlier I didn't visit the dashboard for a few hours. When I just visited it, CPU briefly spiked to 51% and Local Read Bytes also jumped to 69 MB.

As you said, I think if the dashboard for a active web job isn't visited frequently, it tries to work through too much data bringing the whole app service down.

A solution could be to configure a auto-cleanup data after X days of sorts.

mathewc commented 7 years ago

The fact that our existing Dashboard logging is heavy and doesn't currently support retention policies is an unfortunate known issue, and is tracked by https://github.com/Azure/azure-webjobs-sdk/issues/560.

We are actively investigating an App Insights based solution for logging - work is in progress, but no ETA on release. @brettsam

There are ways to disable logging or change the log level for specific functions that might allow you to mitigate this issue. E.g. if its one main function causing most of the load, you can set the TraceLevel for that one lower. However, I'm assuming you want to retain Dashboard logging in full - you just want it to work better. You're also aware it seems of the heavy hammer of deleting the blob logs so they won't be indexed, again not a great solution I realize.

davidebbo commented 7 years ago

@mathewc what doesn't make sense is the massive amount of local read operations (80 GB of it!). @MikeStall said he doesn't think the dashboard would do that.

Still, it's worth trying to disable/reduce logging.

Bio2hazard commented 7 years ago

@davidebbo I think it has to do with serialization. When I traced the CPU usage I found that almost all time was spent in JSON.Net.

Edit: Actually, since the memory got maxed out, the local read operations are probably page file I/O.

Loading the trace in VS2017 actually gave me more information than 2015 did, down to the exact call stack:

image


How about a feature request for blob logs to have a folder structure similar to other Azure services, where the first folder indicates year, then a second folder indicates month and so forth until a hourly granularity is reached.

That would make it trivial to write my own function that cleans up blobs older than X days.

It should not be very hard to implement, and as the indexing and data reading code can use ListBlobs(useFlatBlobListing : true) it should be a non-breaking change?

brettsam commented 7 years ago

As an aside, If you're interested in App Insights, follow https://github.com/Azure/azure-webjobs-sdk-script/issues/727, where we'll be updating status. This applies to the WebJobs SDK as well, not just Script.

j055 commented 7 years ago

Need a step-by-step procedure to purge the logs to avoid the high CPU and memory problem when visiting the dashboard page.

We have functions firing every minute so the indexing can be a real problem if left for a few days.

So far the best mitigation I have come up with is:

  1. Stop WebJob(s)
  2. Delete blob containers: a. azure-jobs-host-archive b. azure-jobs-host-output c. azure-webjobs-dashboard d. azure-webjobs-hosts
  3. Wait for 5-10 mins, to avoid conflicts when webjob tries to create containers again, and start services.

This is clearly not ideal but it’s still better than bringing the whole app service to a grinding halt. Please advise.

christopheranderson commented 7 years ago

So I see a couple items in here:

  1. Document the current behavior and best practices to avoid large amounts of logs slowing down your application
    • This essentially comes down to using a separate App Service Plan for high load scenarios.
  2. Structure the data so it is easier to purge.
    • Our new Dashboard logic we use for Functions does use the monthly format, to help make it easy to remove old data, so if we were to do anything it would be to release that in a proper way
    • This is unlikely due to the fact that we're not confident that investing in our dashboard is the right strategy, long term. In the near term, we'll have a release with App Insights support, as well as, support for custom logging pipelines such as Serilog, which can output to something like Splunk easily.

The PR for the monitoring pipeline improvement is here: https://github.com/Azure/azure-webjobs-sdk/pull/1059

We're not sure if this will be 2.x or 3.x yet.

I'll keep this issue open until we have the doc written and the PR has a release version.

flagbug commented 7 years ago

So after integrating the newest Application Insights logging bits, it still seems like the Webjobs SDK is still creating a crap-ton of blobs in the azure-jobs-host-output container, which apparently are super resource intensive to process.

Is it expected that my webjobs are still creating entries in azure-jobs-host-output, even though I've integrated everything with Application Insights, or did I configure something wrong?

paulbatum commented 7 years ago

@flagbug Remove the AzureWebJobsDashboard app setting from your function app if you want to disable output to the dashboard.

FinVamp1 commented 6 years ago

Could we modify the "Indexing in Progess" message to include information about how this will use Kudu site resources and that this may take time to complete? CSS has been sharing this information with customers.

The Functions Invocation log shows this message ‘Indexing in progress. The information displayed might not be up to date.’

The WebJobs dashboard has an indexing component which runs in the SCM\Kudu Website alongside the Kudu site and any WebJobs executing. It needs CPU and memory to perform this operation and on a busy site with a lot of logs to process this can take a long time to index and potentially may never catch up. This is perhaps related to the fact that the site is running on a single small instance.

The customer has a number of options to address this.

1) Use a different WebApp/App Service plan and put the same AzureWebJobsDashboard setting in it, and it will be able to run your dashboard indexing for you without competing for resources. 2) Truncate the Dashboard logs and have a regular routine to visit the dashboard. To do this they can They can delete the blob containers beginning with “azure-jobs-” and azure-webjobs-”. 3) Scale up the instances until the Indexing catches up. 4) Consider implementing the Application Insights support for Azure WebJobs which provides a different way to view your WebJobs history and is not impacted by the issues detailed above. https://github.com/Azure/azure-webjobs-sdk/wiki/Application-Insights-Integration