Closed watson closed 4 years ago
@Qard let's timebox some investigation on this and see if we get to an easy fix.
Has there been any progress on this issue? I can submit our workaround as a PR for this but I'm not sure thats its the best long term solution
I was unable to reproduce it in my initial testing. I did do some restructuring recently to support async/await style though, which has not yet been released. That might impact it.
Unfortunately it is still an issue, from my poking around the cause is:
Our workaround is to track this error state and ensure that the transport is open before sending any transactions: https://github.com/status200/apm-agent-nodejs/commit/629cd9fb1e8bbfd3b6089c42791bf58dcadfee5d
Though it seems like a larger restructure to track the lifecycle and state of the transport would be more appropriate as this could also be an issue for clients running long enough to encounter network hiccups.
I'm able to reproduce the ECONNRESET, but the client just reconnects and seems to send the new transaction just fine. I'm not getting the write after end error at all.
Are you able to provide a code example that can reproduce the write after end?
For what it is worth, I got it working by editing the code from the docs. I have node8.10 on Lambda.
What the docs show and caused "APM Server transport error: Error: write after end":
exports.handler = apm.lambda(function handler (payload, context, callback) {
callback(null, `Hello, ${payload.name}!`)
})
What worked for me:
exports.handler = apm.lambda(async (payload, context, callback) => {
callback(null, `Hello, ${payload.name}!`)
})
Sorry for the delay, @HuddleHouse is correct, the example from the docs still produces the issue.
There are two methods I can see which stop this:
// Pass promise to lambda
exports.handler = apm.lambda(async (payload, context, callback) => {
callback(null, `Hello, ${payload.name}!`)
})
// callbackWaitsForEmptyEventLoop
exports.handler = apm.lambda(function handler (payload, context, callback) {
context.callbackWaitsForEmptyEventLoop = false;
callback(null, `Hello, ${payload.name}!`)
})
Both of these depend on Lambda's handling of the event loop it seems and I wouldn't like to count on this behaviour staying consistent over time.
I still get this error when using the latest apm in a lambda. I start apm like this:
exports.handler = apm.lambda(async (event, context, callback) => {
...
callback(null, `Successfully processed ${items} events`);
}
2019-04-15T10:48:16.268Z 4f47d833-61f4-4af4-b824-cf46e5d0880d APM Server transport error: Error: write after end
at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:257:12)
at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:301:21)
at Client.sendTransaction (/var/task/node_modules/elastic-apm-http-client/index.js:276:15)
at Instrumentation.addEndedTransaction (/var/task/node_modules/elastic-apm-node/lib/instrumentation/index.js:185:22)
at Instrumentation.wrappedAddEndedTransaction [as addEndedTransaction] (/var/task/node_modules/elastic-apm-node/lib/instrumentation/async-hooks.js:53:34)
at Transaction.end (/var/task/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:218:32)
at captureAndMakeCompleter (/var/task/node_modules/elastic-apm-node/lib/lambda.js:48:11)
at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:60:20)
at exports.handler.apm.lambda (/var/task/index.js:38:9)
at
Yes, it's still a known issue. It's proving to be complicated to fix. Basically, when Lambda freezes the VM it disconnects any sockets still in-use and, when the process wakes up again, it's unaware the sockets have been closed until after it tries to write to them. This is resulting in the write after end
happening when the next message is sent to the apm-server.
I've been thinking that we need a way to either force close the underlying TCP socket after each lambda function execution (though that would increase the function running-time overhead), or we should somehow ignore those errors when running in a lambda context (though that might hide some real errors).
We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket - and metrics inside a lambda doesn't make much sense anyway.
Hi all, so.... current situation is - for everybody reading this issue (please correct me):
Do NOT upgrade/use elastic stack of version 7.x, because elastic-apm-nodejs of version 1.x is not supported with APM of version 7.x
If using elastic-apm-nodejs in AWS lambda, use elastic stack of version < 6.5.x and elastic-apm-node-js of version < 2.x
Martin
Hi @watson,
We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket
We've tried disabling metrics by passing the metricsInterval
config option but we still experience the issue. Do you have other suggestions we may try?
We're running server v7.0.0
with node client ^2.11.5
.
Thank you so much for all the work here!
@willbrazil Setting metricsInterval: 0
will only reduce the likelihood that this error appears, not completely eliminate it. As it stands currently, you might see this error when the VM is being resumed after having been suspended by AWS Lambda. Unfortunately, we can't detect when this happens. However, all data being sent afterward, while the VM is still running, should make it to the APM Server successfully.
If your lambda function is only called once every 20 minutes, this might mean nothing or only very little data get through. But if your lambda function is called several times a minute or more, 99+% of the collected data should get through as expected.
Thank you for such a quick response @watson! Some of our lambdas are handling ~60tps.
We're reproduced in a more controlled environment where we sent a request every second and still experience the issue unfortunately. We'll keep digging here. I know this is a tricky one specially because it's hard to repro locally.
Thank you again for all the info!
I did a bit of investigation, and found a couple of issues.
The stack trace in https://github.com/elastic/apm-agent-nodejs/issues/827#issuecomment-483208921 is due to the client being ended in the transport in response to the beforeExit
event:
Lambda apparently relies on this for scheduling multiple (i.e. warm) Lambda invocations. After the first cold invocation the client is ended, and then subsequent invocations will trigger the stack trace after attempting to write to an ended stream.
Commenting that out, I encountered a second issue: intermittent 502 errors, with the response body of {"message": "Internal server error"}
. It wasn't obvious where this was coming from (logs weren't helpful), but I eventually narrowed it down to invoking the Lambda callback function inside of agent.flush
:
With some additional logging I could see that the flush was occurring after the next beforeExit
. According to https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-code.html:
Background processes or callbacks initiated by your Lambda function that did not complete when the function ended resume if AWS Lambda chooses to reuse the execution context. You should make sure any background processes or callbacks in your code are complete before the code exits.
I suppose that the Lambda runtime is implicitly calling the callback with an empty body. If that is indeed the case, the required Lambda Proxy payload (statusCode, body) will be missing, and likely cause the 502 error.
Hi all,
Is there any progress on fixing this issue? Right now we are stuck on Elastic stack 6.5 and APM agent 1.14.5 and we would really like to use the latest releases.
Thanks.
We ended up writing our own integration to get around this issue and will be sticking with it regardless of how this issue plays out.
I think the current design is a problem for lambda as:
Our solution:
This makes APM message processing passive, isolated from the main system and does not impose networking requirements
The recently released new relic offering for lambda also takes this approach which has given us some more confidence in this decision. https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/get-started/enable-new-relic-monitoring-aws-lambda
We've actually discussed the cloudwatch logs approach for exactly the same reasons you mentioned. It just hasn't yet been prioritized.
Thanks @davebond, that seems like a sensible approach. Is this something that is on the roadmap for the AWS lambda integration for APM @Qard?
There's no specific plan yet, just some discussion. We haven't had many lambda users yet, so it hasn't been prioritized very high so far. Seems like interest has been rising in the last few weeks though.
Same issue reporting on latest stack version 7.4.0
with of elastic-apm-node ^3.0.0
.
It would be a nice addition, in the meantime, I'm using cloudwatch lambda triggers that are similar to @davebond's solution.
Hi @icarusmiles and @davebond.
Can you share some more details on how you have done this? Is it possible to make the apm-agent-nodejs log the requests to console which will be present in cloudwatch, or have you implemented a custom apm agent? And the lambda pushing messages to the apm server, is this custom as well?
Cheers
Hi @tallakh
We opted to write our own APM layer for node, a simple implementation based on async hooks, there were reasons for this other than this particular issue so I wouldn't recommend it for everyone. We store APM messages in our own format until they hit our processing service which translates them into Elastic's and sends them off to elastic using the elastic-apm-http-client
package.
From my poking around in this APM package earlier in the year I think the process for implementing with existing packages would look something like:
elastic-apm-http-client
packageHeh, clever. That could actually work pretty well. :)
Thank you @davebond, that was really helpful!
@davebond Would you mind sharing the code you used for your lambda and log transport? It could be helpful for other people that want to use the agent on lambda to have that sort of option.
@Qard @alvarolobato Any update on this? I might take a stab at trying to implement something with @davebond approach of logging to CW and using a lambda function to subscribe to the CW log stream and send to the APM server.
I have a project at work where I'm using EL(Lambda not logstash haha)K for log aggregation/monitoring for our services would be nice to have the elastic APM integrated into that monitoring solution.
If I can get that working would be happy to work with you guys to get the APM working for Lambda again if thats an approach you want to move towards.
@jonminter thanks for the feedback, we will be happy for any assistance/code contribution and we will be happy to assist. we are working on a more extensive solution but it will take some time, will be happy to chat about your needs. cc @basepi @nehaduggal
@lreuven Ok awesome! I’ll try to put together a proof of concept of what I was thinking, from looking at the code my thoughts were to create a separate transport class following the interface of the http client that writes out logs with the serialized APM data, then write a lambda function that picks up those messages and uses the http client package to send them to the APM server.
I don’t have the code in front of me right now but I think I remember some validation of the config that checks if you supplied the server url and that was required which wouldn’t be used for this. For a POC I’d probably just leave that as is and just ignore that in the AWS lambda transport handler.
I’d also be curious as to what you’re plans are long term, if you were considering going down a different path?
@jonminter Nothing to share yet. From my research the lambda -> cloudwatch -> lambda -> elasticsearch is likely the most promising route. We just need to figure out how to smooth onboarding in that flow, which is a hard problem.
@basepi Yes, I know what you mean. With the original approach all the user has to do is include the package and wrap their lambda handler and they're good to go.
You could bundle some CloudFormation templates and/or Terraform script that performs the deployment of the lambda function/CloudWatch logs subscription in the users AWS account, still more steps than before but should help make the process smoother.
Any thoughts on the wisdom of the approach New Relic took (CloudWatch log ingestion basically) versus writing a transport class that write the serialized version of the message to EventBridge or SQS before the lambda returns and a worker or a lambda picks up the serialized form and then batch sends?
I suspect performance penalty in this versus CW which is highly optimized is non-trivial (awaiting even a publish to EB or SQS is probably much longer than a console.log which is likely highly optimized to exit quickly and then get handled later), and probably cheaper than paying the EB event and/or SQS message tax.
Anyone with more knowledge here care to confirm or disagree with my thoughts on the approach? We are considering writing a solution and would be happy to PR it/work on a collective effort if it makes sense.
Sorry for my absence here :)
@qard There's not much I could share, the only thing that isn't specific to our usage is calling sendTransaction/sendSpan/sendError
on the elastic-apm-http-client
package. @jonminter looks to be on the right track with it.
@24601 You're right, using anything other than the stdio stream to cloudwatch incurs the network tax, adding either extra latency to each lambda invoke or the possibility of dropped transactions.
New Relic's solution was a pain to implement in an enterprise environment, they provide a python script which sets everything up, which sounds like an alright idea until:
I think publishing an app to SAR would be the cleanest approach:
[0] https://aws.amazon.com/serverless/serverlessrepo/ [1] https://github.com/terraform-providers/terraform-provider-aws/issues/3981
@Qard @watson in case it helps with the prioritization, we are new users thus we have elastic v7 and thus it seems we need to use the latest APM. Most of our infrastructure is on lambda, so we it seems we reached a dead end here until APM supports lambda again. Do you have any date on when this will happen?
@chemalopezp No timeline yet, but we currently have an active working group designing our future serverless support. Stay tuned!
Thank you @basepi, those are great news. I'll keep an eye for updates here ;)
Similar to the lambda -> cloudwatch -> lambda -> elasticsearch
approach, a lambda -> cloudwatch -> kinesis -> functionbeat -> elasticsearch
flow could work nicely as well, if we could somehow configure our existing functionbeat to handle the APM messages.
@spectorar that would be great. Currently we are able to stream logs from lambda -> clodwatch -> functionbeat lambda -> elasticsearch
(even without kinesis the functionbeat lambda seems to work just fine). I really like this approach, that could be used for anything streaming into a cloudwatch log (the functionbeat lambda simply subscribes to that log group and streams into elasticsearch)
@basepi do you have by chance any news regarding the AWS Lambda support for the APM agent? I can imagine the actual implementation might take much longer but it would be great to know that will eventually be supported in the near term. Thanks!
No ETA, but we have an active working group finalizing the design for AWS.
Hi @basepi ! Any updates regarding the APM support for AWS Lambda? Maybe some scope of work and estimated date, or at least if there was a decision to support it to some extent? Thank you very much!
I think I can safely say we are planning to support it, but we don't have anything to share yet on roadmap timing.
Hi @basepi I hope you don't mind bugging you again! Is there by chance any update about APM support for AWS Lambda? Thanks!
No update at the moment. Sorry for the slow progress here, between summer vacations and some additional testing we still haven't settled on the design.
Quick update here -- while we have a pretty good idea of the design, I don't know when the actual development work will reach the top of the roadmap. As a result, I still can't give any clear idea of when we'll support AWS lambda.
@chemalopezp if you're still interested in this, would you mind opening an issue in https://github.com/elastic/apm/issues ? That repo is a good place for cross-agent feature requests and is used heavily for roadmapping and gauging community interest. Thanks!
I'm closing this in favour of elastic/apm#352.
See https://discuss.elastic.co/t/apm-server-transport-error-error-write-after-end/165484 for details