Azure / azure-functions-host

The host/runtime that powers Azure Functions
https://functions.azure.com
MIT License
1.92k stars 440 forks source link

Cold start of Azure Function with large npm module trees is slow in Consumption plan #298

Closed arafato closed 6 years ago

arafato commented 8 years ago

Hi--,

I've notice that the initial call of an Azure function (i.e. a couple of minutes after the last call) takes a significant amount of time (up to 15 secs). Is there any way to pre-warm a function other than initially calling it? Will the startup performance be improved in the near future? I'm using it on the Dynamic plan. Thanks!

Best regards --Oliver.

davidebbo commented 8 years ago

Currently, there is a short idle timeout on Function apps, so if they don't get requests for 5 minutes the next request causes a cold start which takes longer. We are looking into increasing this timeout, which will make the situation much less common.

FlavorScape commented 8 years ago

Ive seen delays up to 2 minutes, or even timeout with Azure Function app and node.js on a dynamic hosting plan. This is terrible! running functions app version .5

arafato commented 8 years ago

Since upgrading to 0.5 delays have increased from 15 secs up to 50 secs on my side.

arafato commented 8 years ago

I'm currently mitigating the cold-start issue with an additional pre-warming Azure function that periodically calls the other Azure function like so.

let http = require('http');

module.exports = function (context, myTimer) {
        http.get('http://your.azure-function.net', (res) => {
        context.log(`Successfully called endpoint. Got response: ${res.statusCode}`);   
        context.done();
    }).on('error', (e) => {
        context.log(`Error: ${e.message}`);
        context.done();
    });
};

This function is called every 4 minutes through a timer-triggered function with the following cron job expression 0 */4 * * * *

vance commented 8 years ago

I tried this. I have a timer calling every 4 minutes. I still get delays up to 5 MINUTES Here's the log of it. Fine, fine, fine, TERRIBLE...

screen shot 2016-09-05 at 12 12 35 am
vance commented 8 years ago

So, my classic plan app is running Function App v .4 and it seems that the debugging messages may work better? I get errors showing that don't get in the dynamic app (dynamic app just hangs for long periods). Could faults on one Function affect the performance of the whole app? Sometimes I even get "BAD GATEWAY"

davidebbo commented 8 years ago

This is strange. To make sure that this is indeed related to 0.5 and not some other factors, can you try going back to 0.4 (instructions here)?

peterjuras commented 8 years ago

For me personally, it is not only about the cold start/idle issue, but as I wrote in #529 I think it would be good to have some transparency of the scenarios that are targeted by Azure functions.

If I understand things correctly, the start up time of a function is not only occuring on a cold start of a single container, but also affects new scaled instances if the dynamic plan wants to scale out.

This means that if I have an API HTTP triggered function, and one instance of that function is already running (let's say an execution takes 100ms), it could wait until the execution finishes and then cater the new request, leading to a maximum execution time of 200ms (which is kind of fine).

But instead, the dynamic plan could decide to scale out, and now we have the cold start penalty again. This means that instead of a maximum execution time of 200ms, this request will go above and beyond 5 seconds.

Could you explain how you are handling these cases?

davidebbo commented 8 years ago

@peterjuras let's keep this issue focused on the potential cold start regression which is causing cold start of over 1 minute for some users. We can use a separate thread for a more general discussion of how the system work.

For users running into these extreme cold starts, various questions that would help investigating:

  1. Does 0.4 vs 0.5 make a difference?
  2. What functions are in the Function App? Is it just the one JS functions, or are there others? In particular, do you have any F# functions (there may be an issue there)?
  3. Can you share the name of a Function App where you saw this, and the UTC time of one specific slow cold start?
vance commented 8 years ago

So I did a long running test again where I ping every 3 minutes, the response times are all under 1000 MS unless one of the scheduled jobs failed to fire. Then after the six minutes lapse, the response time goes to a whopping 150,000-280,000 MS. I will try reverting to .4 and see what happens.

davidebbo commented 8 years ago

@vance so you mean you are measuring http functions, but the app happens to also have a scheduled function? What's the function's schedule? Where does 'six minute lapse' come from?

vance commented 8 years ago

I mean I have a timer triggered function doing a HTTP request to my other function. It is set to 3 minutes. However, sometimes it does not fire, and six minutes elapse. The next time it runs, the execution time is extreme. Sometimes it's so slow, it creates a ripple effect where it goes from 280K to 140K to 30K or so, before it finally returns to under 1k MS for the next 10-20 calls.

davidebbo commented 8 years ago

That's very odd behavior. If you can share your Function App name (either directly or indirectly), and the UTC time when you saw this, we can hopefully analyse the situation.

vance commented 8 years ago

Sure, here's a dummy app in the same region. fscape-dynamic3. Any function endpoint shows the extreme time in the logs, I think.

The screenshot is dated

05 Sep 2016 19:12:35 GMT

so it should be around this time.

vance commented 8 years ago

So I reverted to v 0.4 similar results. Also, don't know if it's related, but sometimes the "pinged" services even timeout, or even issue a CORS error, when other requests work. I tried just having a local script run a query every 10 minutes. See that result in the second image. No server changes happened at this time.

image

image

dennis-yemelyanov commented 7 years ago

I'm seeing similar symptoms. Although this might be an unrelated issue, in my case most of the time seems to be spent loading the azure-storage library. Any ideas if this is normal or how this can be fixed?

context.log('requiring azure');
const azure = require('azure-storage');
context.log('required azure'); 

2016-09-10T21:24:46.626 requiring azure 2016-09-10T21:27:04.641 required azure

vance commented 7 years ago

Actually, this might explain a lot of the delay. I think it may be multiplied if you hit a bunch of functions simultaneously when coming out of sleep. Maybe it pegs the I/O for the VM. I do know that this dependency is seems rather large...

I have no idea how this virtualization is engineered, but I wonder if there's some way to snapshot these static resources as globals in a shared memory location that's immutable. Just a thought, don't know if that's even possible.

dennis-yemelyanov commented 7 years ago

Yeah keeping it in memory might not be possible is some scenarios (at least not within the host process, which might restart etc.). I think the main question though is why does even fresh initialization take so long? Azure-storage indeed has tons of dependencies (I was shocked when I first saw it, way too many for a library that effectively is just a simple rest API wrapper). But when I run the same code locally (in a new node process) it executes instantly without any issues, which makes me think it could be an Azure runtime issue...

vance commented 7 years ago

Wonder if it's the network or storage used. Maybe when these instances come back from idle, the data must get loaded over the network since they are not bound to any particular hardware (or VM)? I did not see these long startup times when hitting Classic Service Plan app (with AlwaysOn turned off).

arafato commented 7 years ago

Good point @vance . Deleting an "average" node_modules folder on a dynamic plan takes well over 5 minutes in many cases. IOPS seems to be very low...

vance commented 7 years ago

Any way to have the container manager or whatever app pool manager cache the resources on the container disk, so they at least come back to life without going over the network? ... at least for the first instance.

dennis-yemelyanov commented 7 years ago

Looks like the issue is still happening. Running a simple script that has one line to require 'azure-storage' takes more than a minute. What's the plan to address this issue? Is there any timeline? We need to see if we can keep using Azure functions, because currently this looks like a blocking issue.

dennis-yemelyanov commented 7 years ago

Also just tested this with Classic plan. The delay is 5 seconds, not as bad as Dynamic plan, but seems still higher than normal.

kevinbayes commented 7 years ago

This is still happening an on 0.6 and seem to be seeing huge delays when function has been inactive for a while.

Normal behaviour is around 200ms but when inactive for a while we are seeing 2+ minute response times.

We are using dynamic plan, 512mb, node http triggers.

davidebbo commented 7 years ago

Yes, we have an issue when dealing with large npm trees and 'require' directives. The reason is that the file system we're on is Azure Files, which is very slow.

We are looking at options to optimize this.

lemkepf commented 7 years ago

Has there been any progress on this? We are seeing 30+ second cold startup times on some very simple c# azure functions.

davidebbo commented 7 years ago

@lemkepf I just renamed the issue to better capture the core issue we are discussing here, which is a Node/npm specific thing.

Since you're using C# with a simple function, you are likely running into something else. To avoid merging the two, can you open a new issue with more details:

lemkepf commented 7 years ago

@davidebbo - Thanks. Issue entered: https://github.com/Azure/azure-webjobs-sdk-script/issues/838

cloudkite commented 7 years ago

any updates on this? The warmup time for our app is normally around ~10secs. node_modules folder is about ~25mb. Is there a recommended workaround in the meantime? like using the App service plan instead of consumption plan?

screen shot 2016-11-29 at 11 37 54 am

davidebbo commented 7 years ago

Sorry, no significant update yet, but we are looking at approaches that will make this faster.

Using a Classic app service plan should indeed help a bit as it uses a faster file system. Still, it will be slower than what you'd see on a local machine.

cloudkite commented 7 years ago

@davidebbo thanks, does the Classic app service plan still shutdown after a certain amount of idle time?

davidebbo commented 7 years ago

@cloudkite not if you use a dedicated plan (Basic/Standard/Premium) and turn on Always On.

Of course, at that point you're no longer using a 'serverless' offering, but for some scenario, it can still be a good approach.

cloudkite commented 7 years ago

Great thanks will use that as workaround until module loading speed is usable on the consumption plan.

mjyeaney commented 7 years ago

Hi team - working on an demo project to explore "serverless" designs with my customers, and running into this same issue of slow response times.

Essentially, I have a few functions tied together via ServiceBus topics that do some basic tasks such as send SMS messages (via Twillio), insert data into DocDB, and event one that responds to HTTP requests from the browser (running a DocDB query).

Since the browser endpoint has visible impact, the delay is quite noticeable - often upwards of 10 seconds, but will then be <100msec for a few requests before going back to ~10 seconds. The DocDB metrics show consistent query times of a few milliseconds, so I'm pretty sure this is just the Function runtime overhead causing the variability.

These functions are written in NodeJS, and both have fairly large package trees. I have one function written in PowerShell (just to test) that is running from a Timer that seems to be fine - it's mostly the NodeJS ones that seem impacted.

Glad to share any other details needed.

yvele commented 7 years ago

(@cloudkite how did you generate this nice diagram?)

cloudkite commented 7 years ago

@yvele its a mac app https://daisydiskapp.com/

jarben commented 7 years ago

Any update on this? I'm on the latest (~1) runtime, using standard plan & Always On checked: image

Still getting around 15s cold starts on each fn, subsequent run is less than a sec. (this was 1-2 mins on consume plan)

node_modules is around 20MB (using aws-sdk, request, queryparser)

Didn't figure out the timeout on standard plan, but seems like more than 5 mins..

yvele commented 7 years ago

@jarben I don't know if that can help with Azure Function startup but I have a cleaning scripts that remove useless files in node_modules, I made a private gist of it for reference: https://gist.github.com/yvele/32bd59dbf88bd23f8ad2468f8a32391e (that's only a hacky script for now but I'm gonna make something more robust)

After cleaning my application went from 20mb to 10mb... My CI with deployment script is faster.. But I don't know if that can help with startup time.

jarben commented 7 years ago

Thanks @yvele for suggestion! Every second is definitely worth saving, although even 5-7s is still way too much.

I think it's acceptable when cold starts are maximum of 1-2s but anything above is a show stopper for us.

Also, not sure why this is still an issue when Alwas On is activated as part of the standard (S1) plan, as far as I understand it, it should be a dedicated VM as per this desc?

image

davidebbo commented 7 years ago

@jarben I changed the issue title to clarify that this is discussing the Consumption plan. When you use dedicated plans like 'Standard' with Always On, you should not see any slow downs, as the app stays warm all the time. If you do, let's discuss separately.

jarben commented 7 years ago

Sure @davidebbo , I'll make a sample demo and file another issue.

murraybauer commented 7 years ago

My function (consumption plan) has 5 lines of code to make a http request using the require('request') module. Cold start is always about 28 seconds!! Then 90ms or so for subsequent requests until it's unloaded again.

I'm surprised node.js support for Functions GA'ed.

Can understand why it's so so if it's loading individual files over a remote network share. With AWS Lambda you zip up the function and node_modules folder as a single file for deployment.

yvele commented 7 years ago

The most problematic thing is that sometimes we have timeouts on cold starts (today I had one in Node.js with consumption plan in version ~1)

TurboKoder commented 7 years ago

Any updates on that issue? It is a bit confusing when it looks like serverless concepts are not for a real life scenario :( I believe many of these cases are related to huge libraries for Azure related services, so if it is not possible to connect these services with full functionality bindings then maybe different fix would be possible. If many people are installing and then loading the same copies of node.js libraries why not trying to store those most common libraries in one fast directory (like ultra-global libraries or straight npm) instead of thousands of separate copies in personal user's related directories? I belive real structure might be very different but looking at kudu it looks like you are creating something like OS-level VM when in my opinion it should remind me something much closer to function-level VM structures.

davidebbo commented 7 years ago

We are actively looking at ways to make this faster, but it is still going to take a while longer to get this right. We're hoping to find some initial bits people can try in about a month.

The suggestion to have local copy of those packages on the local drive is something we considered, but in practice it would not work well. The problem is that all those packages are constantly updated, and we don't want to force everyone to use the same one. Also, it breaks down the minute you need some other package that is not in the local list. So we're looking for a more general solution that will limit access to the fast drive, without reducing generality.

1N50MN14 commented 7 years ago

Same issue here... 2-3 minutes cold start time (!) sometimes timeouts too.. ironically it's the Azure Storage SDK filling up node_modules..

1N50MN14 commented 7 years ago

For anyone interested I managed to find a reasonable workaround, dramatically reduced the cold start time for some of my functions from 2-3 minutes (or even timeouts!) to couple seconds. Here's what I did:

  1. node_modules went into .gitignore
  2. Created a dependency file which requires the modules I need from node_modules
  3. Created a single server-side bundle file using browserify aka browserify --node deps.js > deps.bundle.js
  4. Require the bundled dependency file instead - even better, an uglified version (uglifyjs or similar)

I've set up a simple watcher, whenever package.js changes the bundle gets recreated so I don't have to think of it. That's it, no node_modules no cry.

To keep the functions warm I'm intending to setup a cron function which calls all functions every 5 minutes.

@davidebbo Is there any way I could store and require cross-function / shared modules? i.e. something like require('../shared/azstorage.js')?

yvele commented 7 years ago

@1N50MN14 Nice! I'm gonna try that but with webpack instead of browserify ;)

You may have a look at https://github.com/asprouse/serverless-webpack-plugin

kevinbayes commented 7 years ago

@1N50MN14 that is exactly how we handle extensions to our code. We have a lib folder as a function that holds a bespoke SDK. Then we reference this from within our other functionapps.

Example: /Library/sdk.js /Function1/index.js

In /Library/sdk.js

var https = require('https');
var RestClient = (function () {
    //Some code
    return RestClient;
}());

exports.RestClient = RestClient;

In /Function1/index.js

var sdk = require('../Library/sdk');

module.exports = function(context, req) {
    //Some code
    var rest = new sdk.RestClient(context, url);
};
cloudkite commented 7 years ago

for those considering bundling nodejs code with webpack, this might be helpful https://github.com/palmerhq/backpack