elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
582 stars 225 forks source link

Lambda layer cannot read agent intake request #2994

Open lageorgem opened 2 years ago

lageorgem commented 2 years ago

Describe the bug

I am experiencing a bug which is affecting our production environment: In a percentage of requests (about 2%), the lambda function that is sending APM data to the lambda layer receives the following error:

@message | {"log.level":"error","@timestamp":"2022-10-19T08:28:24.326Z","log.origin":{"file.name":"apmproxy/receiver.go","file.line":119},"message":"Could not read agent intake request body: unexpected EOF","ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).handleIntakeV2Events.func1\n\t/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/src/github.com/elastic/apm-aws-lambda/apmproxy/receiver.go:119\nnet/http.HandlerFunc.ServeHTTP\n\t/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\t/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\t/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:1991"}
-- | --
@timestamp | 1666168104326
ecs.version | 1.6.0
log.level | error
log.origin.file.line | 119
log.origin.file.name | apmproxy/receiver.go
log.origin.stack_trace | github.com/elastic/apm-aws-lambda/apmproxy.(*Client).handleIntakeV2Events.func1/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/src/github.com/elastic/apm-aws-lambda/apmproxy/receiver.go:119net/http.HandlerFunc.ServeHTTP/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2109net/http.(*ServeMux).ServeHTTP/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2487net/http.serverHandler.ServeHTTP/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:2947net/http.(*conn).serve/var/lib/jenkins/workspace/ibrary_apm-aws-lambda-mbp_v1.1.0/.gvm/versions/go1.19.linux.amd64/src/net/http/server.go:1991
message | Could not read agent intake request body: unexpected EOF

To Reproduce

Steps to reproduce the behavior: This issue is not easily reproductible, as it happens in a small percentage of cases without a clear cause.

Expected behavior

The lambda layer receives the APM requests correctly

Environment (please complete the following information)

How are you starting the agent? (please tick one of the boxes)

Additional context

This issue has been signaled before here, however the source of the issue described in that bug is not relevant to my case.

Another interesting fact is that we have a suite of 6 lambda functions, all using the layer, however only one of them encounters this issue.

Normally this should not be a problem, but while the lambda layer hangs, it forces the lambda function to remain active until the lambda timeout kills it, which proves problematic for us in peak-traffic times when the extra open sockets overload our MongoDB connection. I have talked before about this here

Agent config options:

  apm.start({
    environment: process.env.STAGE,
    captureBody: 'all',
    captureErrorLogStackTraces: 'always',
    spanStackTraceMinDuration: '0',
    usePathAsTransactionName: true,
    serviceName: 'TMv2 - Source',
    serviceVersion: pkg.version,
  });
lageorgem commented 2 years ago

Hello again,

After further investigation and playing around with lambda timeouts and agent timeouts, I've reached a very interesting issue: The APM agent stopped having these problems, while some aws-sqs internals started having problems with fetching some data from AWS. This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.

While looking over in the APM agent source code, I did notice that you're using elastic-apm-http-client, but the configuration keepAlive is not set. Could it be possible that setting keep-alive could allow the server to use less active sockets?

trentm commented 2 years ago

[Moving discussion from https://github.com/elastic/apm-aws-lambda/issues/333]

This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.

Running out of sockets in the Lambda container sounds plausible.

While looking over in the APM agent source code, I did notice that you're using elastic-apm-http-client, but the configuration keepAlive is not set. Could it be possible that setting keep-alive could allow the server to use less active sockets?

The keepAlive option to elastic-apm-http-client's Client object defaults to true, so I would expect keep-alive to be in use here. I could be missing something, though. Did you notice anything beyond that the keepAlive option is not passed in to Client creation?

trentm commented 2 years ago
  • [x] Calling agent.start() directly (e.g. require('elastic-apm-node').start(...))

...

Agent config options:

  apm.start({
    environment: process.env.STAGE,
    captureBody: 'all',
    captureErrorLogStackTraces: 'always',
    spanStackTraceMinDuration: '0',
    usePathAsTransactionName: true,
    serviceName: 'TMv2 - Source',
    serviceVersion: pkg.version,
  });

Another question for you. You checked this option above and showed those config options. Are you actually explicitly calling apm.start(...) in your Lambda code? The expected way to configure the Elastic APM Node.js agent for Lambda is to start the agent via NODE_OPTIONS = '-r elastic-apm-node/start'. If you are doing so differently, there might be some different behaviour that I wouldn't expect.

trentm commented 2 years ago

while some aws-sqs internals started having problems with fetching some data from AWS.

Are you able to quote what some of those error messages are?

Another interesting fact is that we have a suite of 6 lambda functions, all using the layer, however only one of them encounters this issue.

What does the Lambda that encounters this issue basically do? Can you outline how it works, what services it uses? Are there any things about this Lambda that notably differs from the other 5?

trentm commented 2 years ago

This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.

IIUC the limit will be ~1024 file descriptors (https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html). If hitting socket limits, I'd expect to see error messages about "too many open files".

lageorgem commented 2 years ago

Another question for you. You checked this option above and showed those config options. Are you actually explicitly calling apm.start(...) in your Lambda code? The expected way to configure the Elastic APM Node.js agent for Lambda is to start the agent via NODE_OPTIONS = '-r elastic-apm-node/start'. If you are doing so differently, there might be some different behaviour that I wouldn't expect.

I am using the apm.start() directly in the code. If I am not mistaken, that option is required for including the agent lambda layer, but I am using the npm package directly. I did not have issues with that so far.

Are you able to quote what some of those error messages are?

The AWS sqs message which appears randomly, right now at a rate of about 1%:

Cannot read properties of undefined (reading '#text')
errorType | TypeError
stack.0 | TypeError: Cannot read properties of undefined (reading '#text')
stack.1 | at /var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:2269:30
stack.2 | at runMicrotasks (<anonymous>)
stack.3 | at processTicksAndRejections (node:internal/process/task_queues:96:5)
stack.4 | at async deserializeAws_querySendMessageCommandError (/var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:827:15)
stack.5 | at async /var/task/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24
stack.6 | at async /var/task/node_modules/@aws-sdk/middleware-signing/dist-cjs/middleware.js:11:20
stack.7 | at async StandardRetryStrategy.retry (/var/task/node_modules/@aws-sdk/middleware-retry/dist-cjs/StandardRetryStrategy.js:51:46)
stack.8 | at async /var/task/node_modules/@aws-sdk/middleware-sdk-sqs/dist-cjs/send-message.js:6:18
stack.9 | at async /var/task/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:6:22

Might be related to the agent instrumentation breaking it during its failures? But most interestingly, we either get the APM layer errors or the SQS errors, but never together.

We've also had this error pop up periodically: APM Server transport error (ECONNRESET): socket hang up

What does the Lambda that encounters this issue basically do? Can you outline how it works, what services it uses? Are there any things about this Lambda that notably differs from the other 5?

It's part of a 5-step data processing pipeline, with the first step right now doing mostly filtering. Thinking about it, the lambda right now doesn't do a lot of API calls nor mongo queries. The average duration of this lambda is about 75ms, so it's not heavy at all.

The interesting thing is we had a bit of a hiccup on all the services, encountering the unexpected EOF and timeout errors for a short period of time, with the elastic cluster not reporting anything unusual in terms of throughput. This appeared to have resolved on its own, however the lambda in question did not resolve itself and is still experiencing these errors. Several redeployments have been done, including trying out an older layer version, so I think it's safe to say both the lambda and its layer have been restarted several times.

trentm commented 2 years ago

I am using the apm.start() directly in the code. If I am not mistaken, that option is required for including the agent lambda layer,

You should be able to:

  1. Start the agent early in the Lambda runtime process via NODE_OPTIONS = '-r elastic-apm-node/start', and then
  2. get access to the already started APM agent in your handler code via const apm = require('elastic-apm-node'). Importantly this second step does not re-attempt to .start() the agent.

If you have a Lambda using the 'elastic-apm-node' agent, but not using NODE_OPTIONS=... to start it, then that is a scenario we certainly aren't testing.

Oh, are you manually wrapping your handler function with apm.lambda(...)? If so, that should no longer be necessary. We improved elastic-apm-node to know how to automatically wrap your handler function.

Anyway, I'm not saying this is related to your issue. It doesn't sound like it is.

Cannot read properties of undefined (reading '#text')
errorType | TypeError
stack.0 | TypeError: Cannot read properties of undefined (reading '#text')
stack.1 | at /var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:2269:30
...

Might be related to the agent instrumentation breaking it during its failures?

This doesn't look like it would be due to something from the APM agent. That looks like it is an error parsing some response body. The APM agent doesn't currently instrument @aws-sdk/client-sqs.

lageorgem commented 2 years ago

Hello,

I did proceed as you instructed, I switched to using the APM agent layer, removed all .start() calls and removed the .lambda() wrapper. The APM issues stopped, however I received an ever-increasing amount of aws-sdk errors. Also, It would appear that the memory usage increased from an average of 300MB to about 1300MB, and I had to up the lambda memory to 2048MB in order for it to work properly.

The aws-sdk issue is indeed from parsing a response body, however it does seem to be correlated with the lambda layer existing. Could it be some sort of network issue? Memory leak? It is certainly an issue that's causing problems with our production environment, and thus we've had to disable APM temporarily.

lageorgem commented 2 years ago

Hello again!

We have identified the issue as being related to an SQS payload exceeding the allowed limits of 256kb. The initial error I reported was likely caused by a large increase of errors that needed to be reported to elastic, together with the fact that our setup was using apm.start(), and not the APM Agent layer.

After the internal issue has been resolved and the APM setup was changed as @trentm suggested, everything works as expected.

Thanks for the help!

trentm commented 2 years ago

@lageorgem Thanks very much for getting back with the details. I'm glad that is resolved for you.

I hope the increase in memory usage wasn't all just due to the APM agent gathering and sending error data.

lageorgem commented 2 years ago

@trentm I did in fact validate this, and after disabling APM completely, the increase in memory remained the same. It seems that there was a very large object being loaded, which is the very same object that caused issues with SQS's limits.

trentm commented 2 years ago

@lageorgem Thanks again for getting back with the details!

lageorgem commented 1 year ago

Hello,

Reopening this as we're still experiencing some issues with this error. Right now I am on elastic-apm-extension v1.3 and elastic-apm-node v3.43.0, and after I deployed a high throughput lambda function with APM, I am ocasionally receiving these errors:

{"log.level":"error","@timestamp":"2023-05-09T19:48:59.654Z","log.origin":{"file.name":"apmproxy/receiver.go","file.line":123},"message":"Could not read agent intake request body: read tcp 127.0.0.1:8200->127.0.0.1:58234: i/o timeout","ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).handleIntakeV2Events.func1\n\tgithub.com/elastic/apm-aws-lambda/apmproxy/receiver.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"}

The APM server appears healthy, and didn't report any downtime. While this occurs, looks like the lambda function is kept alive, which causes an increase in concurrent executions, which in turn makes the application reach the MongoDB limit of 16k connections. Likely because the lambda functions are kept alive, more concurrent executions mean more connections. This is also affecting cost, since lambdas live up until their timeout and cost goes up while the lambda is kept alive.

Also see: https://github.com/elastic/apm-aws-lambda/issues/130, which is the same issue

trentm commented 1 year ago

Hi @lageorgem,

(I'm a little worried this'll get lost with the above detail. It might be worthwhile opening a new issue.)

I think ideally the Lambda extension (where the "Could not read agent intake request body" log message is coming from) should "defend" against this "i/o timeout" by having a shorter timeout on intake requests from the APM agent. Perhaps that timeout could only be started after the Lambda invocation has ended. I wonder if there is some timing issue/bug/race in the APM agent that is making this intake request to the extension not complete before it stops processing.

While this occurs, looks like the lambda function is kept alive

Do you know about how long the lambda function is kept alive? It might shed light on where this timeout is defined.

lageorgem commented 1 year ago

Hello,

Looks like the timeout is simply the lambda timeout, which kills the execution after 30 seconds. The ELASTIC_APM_API_REQUEST_TIME value is not set, so it's expected to be the 10s default, while ELASTIC_APM_SERVER_TIMEOUT is set to 15s.

Although on further inspection, seems like that error is more of a one-off, and the more prevalent error is this one:

{"log.level":"error","@timestamp":"2023-05-10T11:28:26.942Z","log.origin":{"file.name":"apmproxy/apmserver.go","file.line":110},"message":"Error sending to APM server, skipping: failed to post to APM server: Post \"https://[MY_ID].apm.eastus.azure.elastic-cloud.com:443/intake/v2/events\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).FlushAPMData\n\tgithub.com/elastic/apm-aws-lambda/apmproxy/apmserver.go:110\ngithub.com/elastic/apm-aws-lambda/app.(*App).Run\n\tgithub.com/elastic/apm-aws-lambda/app/run.go:107\nmain.main\n\tgithub.com/elastic/apm-aws-lambda/main.go:71\nruntime.main\n\truntime/proc.go:250"}

Which is still weird, since it does suggest there's a timeout being reached, and thus lambda should die earlier than the 30s set as lambda timeout.

One additional factor has to do with the APM server itself. We are using Elastic cloud, and it seems that an increase in traffic causes issues, even though the APM server is pretty idle. See image below.

The APM server seems to advertise it's healthy, but saving events is timing out. Usually the fix for this is a quick restart from the elastic cloud console. I have discussed this with support, but they only suggested it may be caused by the available resources, and that I should scale up. Scaling up did not help :D

Screenshot 2023-05-10 at 14 35 01