elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
584 stars 225 forks source link

Lambda support appears to be broken in v2.x #827

Closed watson closed 4 years ago

watson commented 5 years ago

See https://discuss.elastic.co/t/apm-server-transport-error-error-write-after-end/165484 for details

alvarolobato commented 5 years ago

@Qard let's timebox some investigation on this and see if we get to an easy fix.

alvarolobato commented 5 years ago

Related to https://github.com/elastic/apm-agent-nodejs/issues/798

davebond commented 5 years ago

Has there been any progress on this issue? I can submit our workaround as a PR for this but I'm not sure thats its the best long term solution

Qard commented 5 years ago

I was unable to reproduce it in my initial testing. I did do some restructuring recently to support async/await style though, which has not yet been released. That might impact it.

davebond commented 5 years ago

Unfortunately it is still an issue, from my poking around the cause is:

Our workaround is to track this error state and ensure that the transport is open before sending any transactions: https://github.com/status200/apm-agent-nodejs/commit/629cd9fb1e8bbfd3b6089c42791bf58dcadfee5d

Though it seems like a larger restructure to track the lifecycle and state of the transport would be more appropriate as this could also be an issue for clients running long enough to encounter network hiccups.

Qard commented 5 years ago

I'm able to reproduce the ECONNRESET, but the client just reconnects and seems to send the new transaction just fine. I'm not getting the write after end error at all.

Are you able to provide a code example that can reproduce the write after end?

HuddleHouse commented 5 years ago

For what it is worth, I got it working by editing the code from the docs. I have node8.10 on Lambda.

What the docs show and caused "APM Server transport error: Error: write after end":

exports.handler = apm.lambda(function handler (payload, context, callback) {
  callback(null, `Hello, ${payload.name}!`)
})

What worked for me:

exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})
davebond commented 5 years ago

Sorry for the delay, @HuddleHouse is correct, the example from the docs still produces the issue.

There are two methods I can see which stop this:

// Pass promise to lambda
exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})

// callbackWaitsForEmptyEventLoop
exports.handler = apm.lambda(function handler (payload, context, callback) {
  context.callbackWaitsForEmptyEventLoop = false;
  callback(null, `Hello, ${payload.name}!`)
})

Both of these depend on Lambda's handling of the event loop it seems and I wouldn't like to count on this behaviour staying consistent over time.

tallakh commented 5 years ago

I still get this error when using the latest apm in a lambda. I start apm like this:

exports.handler = apm.lambda(async (event, context, callback) => {
    ...
    callback(null, `Successfully processed ${items} events`);
}

2019-04-15T10:48:16.268Z 4f47d833-61f4-4af4-b824-cf46e5d0880d APM Server transport error: Error: write after end at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:257:12) at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:301:21) at Client.sendTransaction (/var/task/node_modules/elastic-apm-http-client/index.js:276:15) at Instrumentation.addEndedTransaction (/var/task/node_modules/elastic-apm-node/lib/instrumentation/index.js:185:22) at Instrumentation.wrappedAddEndedTransaction [as addEndedTransaction] (/var/task/node_modules/elastic-apm-node/lib/instrumentation/async-hooks.js:53:34) at Transaction.end (/var/task/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:218:32) at captureAndMakeCompleter (/var/task/node_modules/elastic-apm-node/lib/lambda.js:48:11) at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:60:20) at exports.handler.apm.lambda (/var/task/index.js:38:9) at at process._tickDomainCallback (internal/process/next_tick.js:228:7)

Qard commented 5 years ago

Yes, it's still a known issue. It's proving to be complicated to fix. Basically, when Lambda freezes the VM it disconnects any sockets still in-use and, when the process wakes up again, it's unaware the sockets have been closed until after it tries to write to them. This is resulting in the write after end happening when the next message is sent to the apm-server.

watson commented 5 years ago

I've been thinking that we need a way to either force close the underlying TCP socket after each lambda function execution (though that would increase the function running-time overhead), or we should somehow ignore those errors when running in a lambda context (though that might hide some real errors).

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket - and metrics inside a lambda doesn't make much sense anyway.

martinduris commented 5 years ago

Hi all, so.... current situation is - for everybody reading this issue (please correct me):

Do NOT upgrade/use elastic stack of version 7.x, because elastic-apm-nodejs of version 1.x is not supported with APM of version 7.x

If using elastic-apm-nodejs in AWS lambda, use elastic stack of version < 6.5.x and elastic-apm-node-js of version < 2.x

Martin

willbrazil commented 5 years ago

Hi @watson,

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket

We've tried disabling metrics by passing the metricsInterval config option but we still experience the issue. Do you have other suggestions we may try?

We're running server v7.0.0 with node client ^2.11.5.

Thank you so much for all the work here!

watson commented 5 years ago

@willbrazil Setting metricsInterval: 0 will only reduce the likelihood that this error appears, not completely eliminate it. As it stands currently, you might see this error when the VM is being resumed after having been suspended by AWS Lambda. Unfortunately, we can't detect when this happens. However, all data being sent afterward, while the VM is still running, should make it to the APM Server successfully.

If your lambda function is only called once every 20 minutes, this might mean nothing or only very little data get through. But if your lambda function is called several times a minute or more, 99+% of the collected data should get through as expected.

willbrazil commented 5 years ago

Thank you for such a quick response @watson! Some of our lambdas are handling ~60tps.

We're reproduced in a more controlled environment where we sent a request every second and still experience the issue unfortunately. We'll keep digging here. I know this is a tricky one specially because it's hard to repro locally.

Thank you again for all the info!

axw commented 5 years ago

I did a bit of investigation, and found a couple of issues.

The stack trace in https://github.com/elastic/apm-agent-nodejs/issues/827#issuecomment-483208921 is due to the client being ended in the transport in response to the beforeExit event:

https://github.com/elastic/apm-nodejs-http-client/blob/7480d2e781a1dc378d348a7835d16265d108df12/index.js#L42

Lambda apparently relies on this for scheduling multiple (i.e. warm) Lambda invocations. After the first cold invocation the client is ended, and then subsequent invocations will trigger the stack trace after attempting to write to an ended stream.

Commenting that out, I encountered a second issue: intermittent 502 errors, with the response body of {"message": "Internal server error"}. It wasn't obvious where this was coming from (logs weren't helpful), but I eventually narrowed it down to invoking the Lambda callback function inside of agent.flush:

https://github.com/elastic/apm-agent-nodejs/blob/63a48c28deeb98184b96dd7221af3936163e9a72/lib/lambda.js#L51-L54

With some additional logging I could see that the flush was occurring after the next beforeExit. According to https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-code.html:

Background processes or callbacks initiated by your Lambda function that did not complete when the function ended resume if AWS Lambda chooses to reuse the execution context. You should make sure any background processes or callbacks in your code are complete before the code exits.

I suppose that the Lambda runtime is implicitly calling the callback with an empty body. If that is indeed the case, the required Lambda Proxy payload (statusCode, body) will be missing, and likely cause the 502 error.

tallakh commented 5 years ago

Hi all,

Is there any progress on fixing this issue? Right now we are stuck on Elastic stack 6.5 and APM agent 1.14.5 and we would really like to use the latest releases.

Thanks.

davebond commented 5 years ago

We ended up writing our own integration to get around this issue and will be sticking with it regardless of how this issue plays out.

I think the current design is a problem for lambda as:

Our solution:

This makes APM message processing passive, isolated from the main system and does not impose networking requirements

The recently released new relic offering for lambda also takes this approach which has given us some more confidence in this decision. https://docs.newrelic.com/docs/serverless-function-monitoring/aws-lambda-monitoring/get-started/enable-new-relic-monitoring-aws-lambda

Qard commented 5 years ago

We've actually discussed the cloudwatch logs approach for exactly the same reasons you mentioned. It just hasn't yet been prioritized.

tallakh commented 5 years ago

Thanks @davebond, that seems like a sensible approach. Is this something that is on the roadmap for the AWS lambda integration for APM @Qard?

Qard commented 5 years ago

There's no specific plan yet, just some discussion. We haven't had many lambda users yet, so it hasn't been prioritized very high so far. Seems like interest has been rising in the last few weeks though.

icarusmiles commented 5 years ago

Same issue reporting on latest stack version 7.4.0 with of elastic-apm-node ^3.0.0.

It would be a nice addition, in the meantime, I'm using cloudwatch lambda triggers that are similar to @davebond's solution.

tallakh commented 5 years ago

Hi @icarusmiles and @davebond.

Can you share some more details on how you have done this? Is it possible to make the apm-agent-nodejs log the requests to console which will be present in cloudwatch, or have you implemented a custom apm agent? And the lambda pushing messages to the apm server, is this custom as well?

Cheers

davebond commented 5 years ago

Hi @tallakh

We opted to write our own APM layer for node, a simple implementation based on async hooks, there were reasons for this other than this particular issue so I wouldn't recommend it for everyone. We store APM messages in our own format until they hit our processing service which translates them into Elastic's and sends them off to elastic using the elastic-apm-http-client package.

From my poking around in this APM package earlier in the year I think the process for implementing with existing packages would look something like:

Qard commented 5 years ago

Heh, clever. That could actually work pretty well. :)

tallakh commented 5 years ago

Thank you @davebond, that was really helpful!

Qard commented 5 years ago

@davebond Would you mind sharing the code you used for your lambda and log transport? It could be helpful for other people that want to use the agent on lambda to have that sort of option.

jonminter commented 4 years ago

@Qard @alvarolobato Any update on this? I might take a stab at trying to implement something with @davebond approach of logging to CW and using a lambda function to subscribe to the CW log stream and send to the APM server.

I have a project at work where I'm using EL(Lambda not logstash haha)K for log aggregation/monitoring for our services would be nice to have the elastic APM integrated into that monitoring solution.

If I can get that working would be happy to work with you guys to get the APM working for Lambda again if thats an approach you want to move towards.

lreuven commented 4 years ago

@jonminter thanks for the feedback, we will be happy for any assistance/code contribution and we will be happy to assist. we are working on a more extensive solution but it will take some time, will be happy to chat about your needs. cc @basepi @nehaduggal

jonminter commented 4 years ago

@lreuven Ok awesome! I’ll try to put together a proof of concept of what I was thinking, from looking at the code my thoughts were to create a separate transport class following the interface of the http client that writes out logs with the serialized APM data, then write a lambda function that picks up those messages and uses the http client package to send them to the APM server.

I don’t have the code in front of me right now but I think I remember some validation of the config that checks if you supplied the server url and that was required which wouldn’t be used for this. For a POC I’d probably just leave that as is and just ignore that in the AWS lambda transport handler.

I’d also be curious as to what you’re plans are long term, if you were considering going down a different path?

basepi commented 4 years ago

@jonminter Nothing to share yet. From my research the lambda -> cloudwatch -> lambda -> elasticsearch is likely the most promising route. We just need to figure out how to smooth onboarding in that flow, which is a hard problem.

jonminter commented 4 years ago

@basepi Yes, I know what you mean. With the original approach all the user has to do is include the package and wrap their lambda handler and they're good to go.

You could bundle some CloudFormation templates and/or Terraform script that performs the deployment of the lambda function/CloudWatch logs subscription in the users AWS account, still more steps than before but should help make the process smoother.

24601 commented 4 years ago

Any thoughts on the wisdom of the approach New Relic took (CloudWatch log ingestion basically) versus writing a transport class that write the serialized version of the message to EventBridge or SQS before the lambda returns and a worker or a lambda picks up the serialized form and then batch sends?

I suspect performance penalty in this versus CW which is highly optimized is non-trivial (awaiting even a publish to EB or SQS is probably much longer than a console.log which is likely highly optimized to exit quickly and then get handled later), and probably cheaper than paying the EB event and/or SQS message tax.

Anyone with more knowledge here care to confirm or disagree with my thoughts on the approach? We are considering writing a solution and would be happy to PR it/work on a collective effort if it makes sense.

davebond commented 4 years ago

Sorry for my absence here :)

@qard There's not much I could share, the only thing that isn't specific to our usage is calling sendTransaction/sendSpan/sendError on the elastic-apm-http-client package. @jonminter looks to be on the right track with it.

@24601 You're right, using anything other than the stdio stream to cloudwatch incurs the network tax, adding either extra latency to each lambda invoke or the possibility of dropped transactions.

New Relic's solution was a pain to implement in an enterprise environment, they provide a python script which sets everything up, which sounds like an alright idea until:

I think publishing an app to SAR would be the cleanest approach:

[0] https://aws.amazon.com/serverless/serverlessrepo/ [1] https://github.com/terraform-providers/terraform-provider-aws/issues/3981

chemalopezp commented 4 years ago

@Qard @watson in case it helps with the prioritization, we are new users thus we have elastic v7 and thus it seems we need to use the latest APM. Most of our infrastructure is on lambda, so we it seems we reached a dead end here until APM supports lambda again. Do you have any date on when this will happen?

basepi commented 4 years ago

@chemalopezp No timeline yet, but we currently have an active working group designing our future serverless support. Stay tuned!

chemalopezp commented 4 years ago

Thank you @basepi, those are great news. I'll keep an eye for updates here ;)

spectorar commented 4 years ago

Similar to the lambda -> cloudwatch -> lambda -> elasticsearch approach, a lambda -> cloudwatch -> kinesis -> functionbeat -> elasticsearch flow could work nicely as well, if we could somehow configure our existing functionbeat to handle the APM messages.

chemalopezp commented 4 years ago

@spectorar that would be great. Currently we are able to stream logs from lambda -> clodwatch -> functionbeat lambda -> elasticsearch (even without kinesis the functionbeat lambda seems to work just fine). I really like this approach, that could be used for anything streaming into a cloudwatch log (the functionbeat lambda simply subscribes to that log group and streams into elasticsearch)

chemalopezp commented 4 years ago

@basepi do you have by chance any news regarding the AWS Lambda support for the APM agent? I can imagine the actual implementation might take much longer but it would be great to know that will eventually be supported in the near term. Thanks!

basepi commented 4 years ago

No ETA, but we have an active working group finalizing the design for AWS.

chemalopezp commented 4 years ago

Hi @basepi ! Any updates regarding the APM support for AWS Lambda? Maybe some scope of work and estimated date, or at least if there was a decision to support it to some extent? Thank you very much!

basepi commented 4 years ago

I think I can safely say we are planning to support it, but we don't have anything to share yet on roadmap timing.

chemalopezp commented 4 years ago

Hi @basepi I hope you don't mind bugging you again! Is there by chance any update about APM support for AWS Lambda? Thanks!

basepi commented 4 years ago

No update at the moment. Sorry for the slow progress here, between summer vacations and some additional testing we still haven't settled on the design.

basepi commented 4 years ago

Quick update here -- while we have a pretty good idea of the design, I don't know when the actual development work will reach the top of the roadmap. As a result, I still can't give any clear idea of when we'll support AWS lambda.

@chemalopezp if you're still interested in this, would you mind opening an issue in https://github.com/elastic/apm/issues ? That repo is a good place for cross-agent feature requests and is used heavily for roadmapping and gauging community interest. Thanks!

trentm commented 4 years ago

I'm closing this in favour of elastic/apm#352.