apex / up

Deploy infinitely scalable serverless apps, apis, and sites in seconds to AWS.
https://up.docs.apex.sh
MIT License
8.78k stars 374 forks source link

"Error: write EPIPE" on lambda errors (it worsens with provisioned concurrency) #801

Open gagoar opened 4 years ago

gagoar commented 4 years ago

Prerequisites

Description

We have been using Up for our lambda solution (express server) for a while now (~2 years) and we were seeing some errors like this (but not that frequently):

{"timestamp":1580983517551,"message":"ERROR\tUncaught Exception\t{\"errorType\":\"Error\",\"errorMessage\":\"write EPIPE\",\"code\":\"EPIPE\",\"errno\":\"EPIPE\",\"syscall\":\"write\",\"stack\":[\"Error: write EPIPE\",\"    at WriteWrap.afterWrite [as oncomplete] (net.js:789:14)\"]}","extractedFields":{"event":"ERROR\tUncaught Exception\t{\"errorType\":\"Error\",\"errorMessage\":\"write EPIPE\",\"code\":\"EPIPE\",\"errno\":\"EPIPE\",\"syscall\":\"write\",\"stack\":[\"Error: write EPIPE\",\"    at WriteWrap.afterWrite [as oncomplete]

to provide a bit more context, we also get an odd message coming almost an hour after this error shows up. image

We decided to add provisioned concurrency and noticed that this error started happening more often.

Currently, this problem is producing 500 to some of our consumers.

Steps to Reproduce

not sure I can describe this in a generic way but we see it quite often with or without provisioned concurrency.

Love Up?

We are using Up pro currently.

tj commented 4 years ago

Hmm thewaiting for .... to be in a listening state timeout should only appear if your application isn't listening on PORT, even if it's only in some Lambda's. Do you connect to a database or similar before allowing connections?

gagoar commented 4 years ago

No, we don't establish any database connection. this code only deals with connections (REST) when queried (graphQL query); so not at start time.

jnwng commented 4 years ago

additional context; this Lambda is running in a VPC, and we have two scheduled benchmarking runs that occur every 10 and 30 minutes, respectively. currently, that's the only source of traffic to the Lambda.

we're also running apollo-server-express in this Lambda

tj commented 4 years ago

What do you have configured for the lambda memory size? You probably already know this, but with lower memory limits such as 128mb the CPU is also limited, so having a lot of require()s can impact cold starts.

15s is pretty long though even in that case, I would start by throwing a few console.log()s in the app just to make sure it's reaching your http server's .listen(PORT) call.

jnwng commented 4 years ago

for the vast majority of invocations, we've been noticing roughly ~5s of initialization, so had initially ruled that out. there's no harm in bumping it, though, even if it's temporary.

re: getting to the listen(PORT) call, it is still quite strange that we see this intermittently, but given that our listen(PORT) call is guarded by a branch, i think it's entirely possible that we're just never getting to the listen:

  if (env.NODE_ENV !== NODE_ENV_TEST) {
    server = app.listen({ port: env.PORT }, (): void => {
      logger.info(`🚀 Server ready at http://localhost:${env.PORT}/playground`);
    });
  }

(where env.NODE_ENV is a filtered version of process.env validated by envalid)

will do both and report back

tj commented 4 years ago

Interesting yea, really strange that it's intermittent. Regarding cold-start time you could try a bundler, just eliminating the requires alone seems to help quite a bit, but let me know if you find anything suspicious!

gagoar commented 4 years ago

Hi @tj, as a follow-up, we upgraded the lambda to 12.x and the errors regarding the "Uncaught Exception" went away. We are still giving it till the end of the week to make sure we are in the clear tho.

tj commented 4 years ago

Nice! Glad you found it, sounds good though I'll close in a few days if it seems like things are good to go