Unpredictable function execution/APIGW response times

smoll commented 7 years ago

This is more of a question/investigatory issue rather than a bug report per se, but I've noticed this behavior on 2 different Zappa-based APIs I have started working on.

Basically, I've noticed that when I call an endpoint after not using it for a while, I will get a very long response time, e.g.

$ time http https://abcd.execute-api.us-east-1.amazonaws.com/dev/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 19
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Oct 2016 07:18:17 GMT
Set-Cookie: zappa=AQ4; Path=/
Via: 1.1 2e4bb09f82d346fc09032cfb35e3c3e4.cloudfront.net (CloudFront)
X-Amz-Cf-Id: lEhkWktpPID6i-IflVRUPALDgCRF8agsL7CI2jnyvD1CcPVKSjG1OQ==
X-Cache: Miss from cloudfront
x-amzn-Remapped-Content-Length: 19
x-amzn-RequestId: 5f184bf1-9695-11e6-9e74-b70f8d856374

This is the index!

real    0m5.518s
user    0m0.275s
sys     0m0.255s

but if I call it again immediately after I get a much faster response

$ time http https://abcd.execute-api.us-east-1.amazonaws.com/dev/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 19
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Oct 2016 07:18:19 GMT
Set-Cookie: zappa=AQ4; Path=/
Via: 1.1 031c38bec1e4f8401157e1d767a53637.cloudfront.net (CloudFront)
X-Amz-Cf-Id: wTr7B0ZTGHTY7GRm-Uw2IY76pcPhc8Bnb7wMLiDhFR9wwd4XFuWd5w==
X-Cache: Miss from cloudfront
x-amzn-Remapped-Content-Length: 19
x-amzn-RequestId: 6077b516-9695-11e6-a92a-657b31009f49

This is the index!

real    0m0.771s
user    0m0.264s
sys     0m0.168s

The "API" implementation is a simple print statement so it's not worth showing, but here are my Zappa settings:

{
    "dev": {
        "project_name": "myapi",
        "app_function": "myapi.app",
        "manage_roles": false,
        "role_name": "ZappaLambdaExecution",
        "s3_bucket": "dev-myapi-bucket",
        "domain": "myapi.com",
        "lets_encrypt_key": "s3://mybucket/account.key",
        "lets_encrypt_expression": "rate(2 days)",
        "keep_warm": true,
        "keep_warm_expression": "rate(1 minute)",
        "timeout_seconds": 69
    }
}

^ I increased the frequency on the keep_warm_expression key to see if it helps, but it doesn't seem to.

Don't see anything in CloudWatch either, all of the log lines look like this:

REPORT RequestId: dde080d0-9695-11e6-9593-ab204f2952ay  Duration: 0.62 ms   Billed Duration: 100 ms Memory Size: 512 MB Max Memory Used: 21 MB

any ideas where to look for more info?

smoll commented 7 years ago

I've been playing around with frameworkless Lambda + API Gateway and I'm noticing the exact same behavior when I have a standalone Lambda that I never pre-warm.

This makes me wonder whether my Zappa-managed keep_warm is doing anything at all. Has anyone here actually verified the keep_warm is correctly keeping their function warm? If so, I'll try to look into what might be different about my configuration. If not, I'll dig more into the Zappa source code around it.

Please chime in if you have any relevant data to share!

Edit: I must have glossed over this before, but I do see lines like this in Cloudwatch every minute, so this definitely means the event is firing. It does seem like it is having no effect on the startup time of the function, though:

[DEBUG] 2016-10-23T09:24:53.743Z    8e3aa106-9902-11e6-918a-c988a846fd49    Zappa Event: {u'account': u'123456789012', u'region': u'us-east-1', u'detail': {}, u'detail-type': u'Scheduled Event', u'source': u'aws.events', u'version': u'0', u'time': u'2016-10-23T09:24:38Z', u'id': u'a85fe08b-add6-4fd4-bcda-d32d14eadc76', u'resources': [u'arn:aws:events:us-east-1:123456789012:rule/myproject-zappa-keep-warm-handler.keep_warm_callback']}

jnlaia commented 7 years ago

I am having similar issues.

Below (left axis are miliseconds), a keep warm with a 3 minutes rate, seems to have spikes every three hours or so, which is interesting/weird by itself. Also, when I actually make a manual request (around 10:45), I still have a high response time.

Does anyone have a clue on what is going on?

Thanks

jnlaia commented 7 years ago

Some more feedback on this. It actually seems that the keep warm from zappa is not doing anything. However, if one warms up the webapp by making an actual request via the API gateway (for example, via another service), it is possible to keep the webapp warm. So it seems the API gateway is playing a role here.

wagmiwiz commented 7 years ago

I am seeing this too.

We have one api that has irregular but bursty traffic. So not much happens, then suddenly many requests happen at once. If you do the burst batches one after the other then the second batch is super fast. But the first batch has many requests taking several seconds. I noticed its due to several Lambdas spinning up (Instancing... in the logs), and that takes 2-5 seconds.

One wild guess is that keep warm just keeps one Lambda warm and a burst obviously requires concurrency..... or it is indeed something to do with a difference in the way Lambda routes the event depending on whether it comes from API Gateway or is scheduled. 🤔

danielwhatmuff commented 7 years ago

Might be worth using https://uptimerobot.com/ or https://updown.io to send a request via API Gateway as an additional keep warm. I do that and haven't seen this behaviour.

jleclanche commented 6 years ago

So, today I investigated a long-standing issue we had with one of our zappa-powered lambdas. The lambda is configured with a 30 second timeout and is expected to have an avg runtime of 500ms.

This doesn't look right: We're getting a ton of 30 seconds max durations, we can see in the graph on the right that there are increased timeouts at peak time.

I had assumed the timeouts were related to some kind of database or external-api contention, but I found that they were all in cold starts. Some of the lambda machines would need up to 40 seconds to do a cold start; if they hit the timeout, they would get stuck permanently trying to cold start and eventually give up and be rotated out. Increasing the default timeout of 30 seconds to 45 seconds actually fixed the issue. See the red line here:

Now, the real question is why is a cold start taking so long? I profiled a lambda that took 22 seconds to complete. Of the 22 seconds, the following two lines of code (in zappa.ext.django_zappa.get_django_wsgi) took 21 seconds to run:

from django.core.wsgi import get_wsgi_application  # 6500ms
...
get_wsgi_application()  # 14500ms

This is a 128MB lambda (Python 3.6 runtime), so it has the weakest cpu. I expected cold starts to be bad, but this is pretty brutal. I strace'd those two calls locally and nothing really stands out. Most of the time is spent iterating over all the dependencies (and there's really not that many: django, django-allauth, django-cors-headers, oauth-toolkit, djangorestframework, psycopg2, pyjwt and raven). Most of that time is spent opening Zappa itself and its dependencies. psycopg2 is pretty bad as well. But on my home cpu, all this takes 300ms. I could understand it being one order of magnitude slower on the weakest lambda, but two is really confusing. Unfortunately, running an strace on lambda isn't exactly easy, so I'm not completely sure what to do next to improve the situation.

Hope this helps someone.

wagmiwiz commented 6 years ago

This is really good information, kudos! I suspect it is probably something to with this dependency loading since I noticed our cold starts are slower for lambdas where we have more dependencies/larger deploy files (a factor of 2 to 3 from an api that mostly has django, django rest and psycopg2 to one that also has numpy). For now we are using brute force of a higher CPU lambda to bring it down to a few seconds so it doesn't affect our users as much but its obviously not a long term solution.

wagmiwiz commented 6 years ago

p.s. how did you profile the lambda to put a number on those two lines?

jleclanche commented 6 years ago

I just added some timing calls in Zappa and deployed that instead.

jakul commented 5 years ago

I've done some work recently to reduce application startup times on cold start, and came up with the following conclusions. They might be specific to my application, but I'm posting them in case they can guide someone else:

cProfile memory traces really help to understand what's going on. I used iopipe to setup the tracing and to get the traces out of the lambda, but you could also just implement those bits yourself. https://github.com/iopipe/iopipe-python/blob/3dce171ac1981ea89c08bbdd086fa1bb7e677204/iopipe/contrib/profiler/plugin.py shows how they do it
I used snakeviz and gprof2dot to help visualise the traces
The biggest causes of slow startup for my application were:
- validation of a Swagger spec by Connexion
- uploading .py files instead of .pyc files (Python takes more time to import .py files)
- use of pkg_resources library
- I also noticed that the bigger the lambda the quicker it can actually read files from the file system (which is done as part of the module import) - on 128MB the get_data took 2.5 seconds, on 512MB it took 0.57s and on 1024MB it took 0.32s

jleclanche commented 5 years ago

use of pkg_resources library

Can you elaborate on that? I usually use __version__ = pkg_resources.require(pkgname)[0].version to set my libraries' version and I'm suddenly realizing I never really looked at how expensive that call is.

jleclanche commented 5 years ago

I also noticed that the bigger the lambda the quicker it can actually read files from the file system (which is done as part of the module import) - on 128MB the get_data took 2.5 seconds, on 512MB it took 0.57s and on 1024MB it took 0.32s

This is because you're bumping up the CPU of the lambda every time you increase its RAM. Anything CPU bound will linearly increase accordingly.

Miserlou / Zappa

Unpredictable function execution/APIGW response times #418