Open lageorgem opened 2 years ago
Hello again,
After further investigation and playing around with lambda timeouts and agent timeouts, I've reached a very interesting issue: The APM agent stopped having these problems, while some aws-sqs internals started having problems with fetching some data from AWS. This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.
While looking over in the APM agent source code, I did notice that you're using elastic-apm-http-client, but the configuration keepAlive is not set. Could it be possible that setting keep-alive could allow the server to use less active sockets?
[Moving discussion from https://github.com/elastic/apm-aws-lambda/issues/333]
This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.
Running out of sockets in the Lambda container sounds plausible.
While looking over in the APM agent source code, I did notice that you're using
elastic-apm-http-client
, but the configurationkeepAlive
is not set. Could it be possible that setting keep-alive could allow the server to use less active sockets?
The keepAlive
option to elastic-apm-http-client's Client
object defaults to true, so I would expect keep-alive to be in use here. I could be missing something, though. Did you notice anything beyond that the keepAlive
option is not passed in to Client
creation?
- [x] Calling
agent.start()
directly (e.g.require('elastic-apm-node').start(...)
)...
Agent config options:
apm.start({ environment: process.env.STAGE, captureBody: 'all', captureErrorLogStackTraces: 'always', spanStackTraceMinDuration: '0', usePathAsTransactionName: true, serviceName: 'TMv2 - Source', serviceVersion: pkg.version, });
Another question for you. You checked this option above and showed those config options. Are you actually explicitly calling apm.start(...)
in your Lambda code? The expected way to configure the Elastic APM Node.js agent for Lambda is to start the agent via NODE_OPTIONS = '-r elastic-apm-node/start'
. If you are doing so differently, there might be some different behaviour that I wouldn't expect.
while some aws-sqs internals started having problems with fetching some data from AWS.
Are you able to quote what some of those error messages are?
Another interesting fact is that we have a suite of 6 lambda functions, all using the layer, however only one of them encounters this issue.
What does the Lambda that encounters this issue basically do? Can you outline how it works, what services it uses? Are there any things about this Lambda that notably differs from the other 5?
This leads me to believe that it may be a problem with too many sockets open in about 0.3% of cases.
IIUC the limit will be ~1024 file descriptors (https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html). If hitting socket limits, I'd expect to see error messages about "too many open files".
Another question for you. You checked this option above and showed those config options. Are you actually explicitly calling
apm.start(...)
in your Lambda code? The expected way to configure the Elastic APM Node.js agent for Lambda is to start the agent viaNODE_OPTIONS = '-r elastic-apm-node/start'
. If you are doing so differently, there might be some different behaviour that I wouldn't expect.
I am using the apm.start()
directly in the code. If I am not mistaken, that option is required for including the agent lambda layer, but I am using the npm package directly. I did not have issues with that so far.
Are you able to quote what some of those error messages are?
The AWS sqs message which appears randomly, right now at a rate of about 1%:
Cannot read properties of undefined (reading '#text')
errorType | TypeError
stack.0 | TypeError: Cannot read properties of undefined (reading '#text')
stack.1 | at /var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:2269:30
stack.2 | at runMicrotasks (<anonymous>)
stack.3 | at processTicksAndRejections (node:internal/process/task_queues:96:5)
stack.4 | at async deserializeAws_querySendMessageCommandError (/var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:827:15)
stack.5 | at async /var/task/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24
stack.6 | at async /var/task/node_modules/@aws-sdk/middleware-signing/dist-cjs/middleware.js:11:20
stack.7 | at async StandardRetryStrategy.retry (/var/task/node_modules/@aws-sdk/middleware-retry/dist-cjs/StandardRetryStrategy.js:51:46)
stack.8 | at async /var/task/node_modules/@aws-sdk/middleware-sdk-sqs/dist-cjs/send-message.js:6:18
stack.9 | at async /var/task/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:6:22
Might be related to the agent instrumentation breaking it during its failures? But most interestingly, we either get the APM layer errors or the SQS errors, but never together.
We've also had this error pop up periodically:
APM Server transport error (ECONNRESET): socket hang up
What does the Lambda that encounters this issue basically do? Can you outline how it works, what services it uses? Are there any things about this Lambda that notably differs from the other 5?
It's part of a 5-step data processing pipeline, with the first step right now doing mostly filtering. Thinking about it, the lambda right now doesn't do a lot of API calls nor mongo queries. The average duration of this lambda is about 75ms, so it's not heavy at all.
The interesting thing is we had a bit of a hiccup on all the services, encountering the unexpected EOF
and timeout
errors for a short period of time, with the elastic cluster not reporting anything unusual in terms of throughput. This appeared to have resolved on its own, however the lambda in question did not resolve itself and is still experiencing these errors.
Several redeployments have been done, including trying out an older layer version, so I think it's safe to say both the lambda and its layer have been restarted several times.
I am using the apm.start() directly in the code. If I am not mistaken, that option is required for including the agent lambda layer,
You should be able to:
NODE_OPTIONS = '-r elastic-apm-node/start'
, and thenconst apm = require('elastic-apm-node')
. Importantly this second step does not re-attempt to .start()
the agent.If you have a Lambda using the 'elastic-apm-node' agent, but not using NODE_OPTIONS=...
to start it, then that is a scenario we certainly aren't testing.
Oh, are you manually wrapping your handler function with apm.lambda(...)
? If so, that should no longer be necessary. We improved elastic-apm-node to know how to automatically wrap your handler function.
Anyway, I'm not saying this is related to your issue. It doesn't sound like it is.
Cannot read properties of undefined (reading '#text') errorType | TypeError stack.0 | TypeError: Cannot read properties of undefined (reading '#text') stack.1 | at /var/task/node_modules/@aws-sdk/client-sqs/dist-cjs/protocols/Aws_query.js:2269:30 ...
Might be related to the agent instrumentation breaking it during its failures?
This doesn't look like it would be due to something from the APM agent. That looks like it is an error parsing some response body. The APM agent doesn't currently instrument @aws-sdk/client-sqs
.
Hello,
I did proceed as you instructed, I switched to using the APM agent layer, removed all .start() calls and removed the .lambda() wrapper. The APM issues stopped, however I received an ever-increasing amount of aws-sdk errors. Also, It would appear that the memory usage increased from an average of 300MB to about 1300MB, and I had to up the lambda memory to 2048MB in order for it to work properly.
The aws-sdk issue is indeed from parsing a response body, however it does seem to be correlated with the lambda layer existing. Could it be some sort of network issue? Memory leak? It is certainly an issue that's causing problems with our production environment, and thus we've had to disable APM temporarily.
Hello again!
We have identified the issue as being related to an SQS payload exceeding the allowed limits of 256kb.
The initial error I reported was likely caused by a large increase of errors that needed to be reported to elastic, together with the fact that our setup was using apm.start()
, and not the APM Agent layer.
After the internal issue has been resolved and the APM setup was changed as @trentm suggested, everything works as expected.
Thanks for the help!
@lageorgem Thanks very much for getting back with the details. I'm glad that is resolved for you.
I hope the increase in memory usage wasn't all just due to the APM agent gathering and sending error data.
@trentm I did in fact validate this, and after disabling APM completely, the increase in memory remained the same. It seems that there was a very large object being loaded, which is the very same object that caused issues with SQS's limits.
@lageorgem Thanks again for getting back with the details!
Hello,
Reopening this as we're still experiencing some issues with this error. Right now I am on elastic-apm-extension v1.3 and elastic-apm-node v3.43.0, and after I deployed a high throughput lambda function with APM, I am ocasionally receiving these errors:
{"log.level":"error","@timestamp":"2023-05-09T19:48:59.654Z","log.origin":{"file.name":"apmproxy/receiver.go","file.line":123},"message":"Could not read agent intake request body: read tcp 127.0.0.1:8200->127.0.0.1:58234: i/o timeout","ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).handleIntakeV2Events.func1\n\tgithub.com/elastic/apm-aws-lambda/apmproxy/receiver.go:123\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.(*ServeMux).ServeHTTP\n\tnet/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"}
The APM server appears healthy, and didn't report any downtime. While this occurs, looks like the lambda function is kept alive, which causes an increase in concurrent executions, which in turn makes the application reach the MongoDB limit of 16k connections. Likely because the lambda functions are kept alive, more concurrent executions mean more connections. This is also affecting cost, since lambdas live up until their timeout and cost goes up while the lambda is kept alive.
Also see: https://github.com/elastic/apm-aws-lambda/issues/130, which is the same issue
Hi @lageorgem,
(I'm a little worried this'll get lost with the above detail. It might be worthwhile opening a new issue.)
I think ideally the Lambda extension (where the "Could not read agent intake request body" log message is coming from) should "defend" against this "i/o timeout" by having a shorter timeout on intake requests from the APM agent. Perhaps that timeout could only be started after the Lambda invocation has ended. I wonder if there is some timing issue/bug/race in the APM agent that is making this intake request to the extension not complete before it stops processing.
While this occurs, looks like the lambda function is kept alive
Do you know about how long the lambda function is kept alive? It might shed light on where this timeout is defined.
Hello,
Looks like the timeout is simply the lambda timeout, which kills the execution after 30 seconds. The ELASTIC_APM_API_REQUEST_TIME
value is not set, so it's expected to be the 10s default, while ELASTIC_APM_SERVER_TIMEOUT is set to 15s.
Although on further inspection, seems like that error is more of a one-off, and the more prevalent error is this one:
{"log.level":"error","@timestamp":"2023-05-10T11:28:26.942Z","log.origin":{"file.name":"apmproxy/apmserver.go","file.line":110},"message":"Error sending to APM server, skipping: failed to post to APM server: Post \"https://[MY_ID].apm.eastus.azure.elastic-cloud.com:443/intake/v2/events\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","ecs.version":"1.6.0","log.origin.stack_trace":"github.com/elastic/apm-aws-lambda/apmproxy.(*Client).FlushAPMData\n\tgithub.com/elastic/apm-aws-lambda/apmproxy/apmserver.go:110\ngithub.com/elastic/apm-aws-lambda/app.(*App).Run\n\tgithub.com/elastic/apm-aws-lambda/app/run.go:107\nmain.main\n\tgithub.com/elastic/apm-aws-lambda/main.go:71\nruntime.main\n\truntime/proc.go:250"}
Which is still weird, since it does suggest there's a timeout being reached, and thus lambda should die earlier than the 30s set as lambda timeout.
One additional factor has to do with the APM server itself. We are using Elastic cloud, and it seems that an increase in traffic causes issues, even though the APM server is pretty idle. See image below.
The APM server seems to advertise it's healthy, but saving events is timing out. Usually the fix for this is a quick restart from the elastic cloud console. I have discussed this with support, but they only suggested it may be caused by the available resources, and that I should scale up. Scaling up did not help :D
Describe the bug
I am experiencing a bug which is affecting our production environment: In a percentage of requests (about 2%), the lambda function that is sending APM data to the lambda layer receives the following error:
To Reproduce
Steps to reproduce the behavior: This issue is not easily reproductible, as it happens in a small percentage of cases without a clear cause.
Expected behavior
The lambda layer receives the APM requests correctly
Environment (please complete the following information)
How are you starting the agent? (please tick one of the boxes)
agent.start()
directly (e.g.require('elastic-apm-node').start(...)
)elastic-apm-node/start
from within the source code-r elastic-apm-node/start
Additional context
This issue has been signaled before here, however the source of the issue described in that bug is not relevant to my case.
Another interesting fact is that we have a suite of 6 lambda functions, all using the layer, however only one of them encounters this issue.
Normally this should not be a problem, but while the lambda layer hangs, it forces the lambda function to remain active until the lambda timeout kills it, which proves problematic for us in peak-traffic times when the extra open sockets overload our MongoDB connection. I have talked before about this here
Agent config options: