Azure Functions JavaScript cold start in a backend API

petmat commented 5 years ago

I'm working on a backend API for a chatbot and we have been using Azure Functions with the comsumption plan from the beginning. Lately we have been concerned about the performance of our backend API since we have been getting reports about the bot being really slow occasionally and sometimes just timeouting altogether. We have diagnosed these issues as being caused by cold start latency.

I've been trying to run my own tests to find out what kind of issues we are having with cold start. All our Functions are all written in JavaScript with HttpTriggers so I tested only that type of Functions. The findings I have come up with deeply trouble me.

I started by looking at the Performance tab in Application Insights and it revealed that while the average operation times are really good, it's the 99th percentile that is less stellar.

Here's the average:

And here is the 99th percentile:

I blurred out the function names but these are the most crucial ones since the first and third are functions that are called each time a new chat is started with the bot. The second function is called each time a new message is received from the user. The experience is really bad for the user in two ways. Of course waiting for the response from the bot is bad experience in it self but what is even worse is that sometimes the wait can be so long that the chat client actually timeouts completely when waiting for the answer.

Next I wanted to see how much of the cold start issue was caused by our code and maybe not so much by the runtime. So I created two kind of Function Apps. First one is the simplest hello world with no NPM packages as dependencies. Second one is still a simple hello world but I added all the NPM packages we are using with our chatbot. Mind you that we are using the Run-From-Package deploy method to battle the cold start issue already. Still when zipped the larger Function App was 4 MB in size.

Here are the results for the larger Function App:

While the warm function instances are super fast the cold starts range from a minimum of 2 seconds to an average around 4 seconds and some even longer than 10 seconds. The most concerning thing to me about this is the high variation in cold starts. A 2 second cold start would be reasonable and something we could live with but more than 4 seconds starts to be painful and more than 10 seconds is just impossible. Also these are only hello world functions with no actual business logic. Real life functions take time to execute which increases the time the user has to sometimes wait for the bot to respond.

Lately I also came across this excellent blog post by Mikhail Shilkov https://mikhail.io/serverless/coldstarts/big3/ It shows that the cold starts in Azure Functions are quite bad compared to AWS Lambda. This is a bit of a shock since I didn't expect the difference to be so huge. I ran my benchmarks on AWS to make sure myself and came up with the same kind of results. In AWS the cold starts are much more consistent in length and range from 1 to 2 seconds.

I hope you don't take my feedback the wrong way. I've been using Azure Functions since 2016 when it was announced and been a strong proponent of it in my own community. Cold starts have not really been such a huge deal until I started developing the current project, where Azure Functions is used as a backend API. But this issue with the cold starts is something that I really hope will be solved soon. Especially since v2 runtime cold starts are even worse than what I reported (in JavaScript functions)

I know I could run Azure Functions in a normal hosting plan without the consumption plan but it kind of spoils the whole point of going serverless for me. Our chatbot backend is getting quite a lot of traffic and the amount increases constantly. I would not want to take on the task of manually scaling up the service as needed. Also I'm concerned about how it would affect the costs of running the backend API.

If there is any light you could shine on this issue or suggest me some alternate remedies or workarounds to the cold start issues that we are having, I would greatly appreciate it! Thank you!

safihamid commented 5 years ago

@petmat would you please let us know your function app name(s) and approx times you ran your cold start tests? you can use this template: https://github.com/Azure/azure-functions-host/wiki/Sharing-Your-Function-App-name-privately

Also when you say, you are running from zip, where is your zip content located at? Is it on a storage blob or local zip? We know of cold start issues when using external storage more info here: https://docs.microsoft.com/en-us/azure/azure-functions/run-functions-from-deployment-package

"When running a function app on Windows, the external URL option yields worse cold-start performance. When deploying your function app to Windows, you should set WEBSITE_RUN_FROM_PACKAGE to 1 and publish with zip deployment"

petmat commented 5 years ago

@safihamid Sure thing! My cold start tests were run in function apps with names:

azure-func-med-v1 azure-func-med-v2 azure-func-simple-v1 azure-func-simple-v2

The benchmarks were run from 2019-01-29T23:02:34.625Z to 2019-02-15T08:30:19.995Z with 30 minute intervals.

Also if you want to look into the actual production function app our customer is having performance issues with, it's region is north-europe, the execution time was 2019-02-28T21:00:57.098 and execution ID is 5791cf21-4594-4ba9-bf92-13ff5dc833ff.

Our ZIP file is located on a storage blob. I will make another benchmark with running functions from deployment package and get back to you with the results.

safihamid commented 5 years ago

OK thanks! let me know your results with local zip, time range and 50th, 99th percentile and I will take a look. We have already made some improvement for this scenario which should improve numbers 10-15% and the fix should be released everywhere by end of March. Also currently cold start for Javascript V1 should be better than V2 but we are working on it and V2 cold start for Javascript should improve significantly within the next couple of months.

petmat commented 5 years ago

I created a new function app called azure-func-med-v1-depzip and I have accumulated results now from time range 2019-02-28 to 2019-03-05 and put the results next to the previous ones:

The one on the right are the results with local zip / deployment package. It does not seem to have a huge effect. If anything the cold starts seem to have a bit less variation as they are maybe a bit tighter grouped.

I tried to get the AVG and 99th percentile numbers from App Insights but for some reason the App Insights does not show any requests even though the instrumentation key is set in the app settings and it does show app insights in the function settings. But here are my own calculations: Average is 2031 ms and 99th percentile is 6363 ms. Note that this is for the azure-func-med-v1-depzip function app and not the production chatbot app that the previous screenshots were for. I have not yet made any changes there.

safihamid commented 5 years ago

I can see your P95+ numbers have improved. We are tracking cold start numbers with these two issues which should bring down cold start numbers.

https://github.com/Azure/azure-functions-host/issues/4184 https://github.com/Azure/azure-functions-host/issues/4183

petmat commented 5 years ago

Thank you for the info! I'm looking forward to seeing improvements once these issues get resolved.

Azure / azure-functions-host

Azure Functions JavaScript cold start in a backend API #4138