elastic / apm-nodejs-http-client

**Moved to elastic/apm-agent-nodejs.** A low-level Node.js HTTP client for communicating with the Elastic APM intake API
MIT License
21 stars 30 forks source link

feat: ensure refed socket #166

Closed astorm closed 3 years ago

astorm commented 3 years ago

Addresses https://github.com/elastic/apm-nodejs-http-client/issues/165

This PR addresses the following

The use of an unrefed socket/connection can lead to a lambda function freezing before all data has been sent. This, in turn, can lead to the data being lost when the lambda unfreezes due to a closed connection

by ensuring the socket is refed, which will prevent the Lambda function from exiting early.

Additionally, with this fix in place and with the agent communicating with the lambda extension prototype (vs. communicating directly with a remote APM server), the following problem

We've observed when using the lambda callback based API (vs. using an async function or a returned promise) that the client's stream can be rendered un-writable.

was not present.

apmmachine commented 3 years ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

#### Build stats * Start Time: 2021-10-01T06:10:08.173+0000 * Duration: 3 min 20 sec * Commit: 947bfcd6260700d62e90a5323ebd373156be1444 #### Test stats :test_tube: | Test | Results | | ------------ | :-----------------------------: | | Failed | 0 | | Passed | 16989 | | Skipped | 0 | | Total | 16989 |

trentm commented 3 years ago

Is adding a test for this behaviour reasonable? Or will that come from tests in the agent using this?

astorm commented 3 years ago

@trentm

Is adding a test for this behaviour reasonable?

Open to it, but when I tried to reason though a test case that would exercise this code in a meaningful way I came up short. Did you have something specific in mind?

trentm commented 3 years ago

Regarding a test case for this: I think it would be possible to reproduce what the Node.js Lambda Runtime is doing here, but that efforts are, at least initially, best spent on getting test cases in apm-agent-nodejs.git using this and testing with each type of Node.js lambda function handler: async, sync (i.e. using a callback), sync (with context.callbackWaitsForEmptyEventLoop = false); and playing with having a short setTimeout to artificially have an active handle.

My understanding is we should be able to setup tests for these running https://hub.docker.com/r/amazon/aws-lambda-nodejs in Docker.

Also, I think we should get those tests going in a PR on the agent (using these changes to the http-client) before this PR is merged, because I'm not sure the solution here is complete.

  1. I think there might be surprise interaction with our http-client with multiple invocations of the function on the same VM/runtime.

Looking at the AWS Lambda Node.js Runtime shows how the runtime handles the "freeze when there are no more active handles" for the callback-style sync handler:

https://github.com/aws/aws-lambda-nodejs-runtime-interface-client/blob/24866d6cf8a899577784ee1f11c3f6997cf57267/src/Runtime/CallbackContext.ts#L75-L93

It is using a "beforeExit" event handler (registered here). That's fine. However, the http-client is also using a "beforeExit" handler:

https://github.com/elastic/apm-nodejs-http-client/blob/ce3f92f2fad7f33c78ddbda945932a566d5410b5/index.js#L54-L62

which initiates (a) this._intakeRequestGracefulExitFn() which is good and (b) this.end() which I think is bad. I would think this means that either right at the end of the first invocation of the function in this VM or right at the start of the second invocation the http Client will destroy itself.

It isn't clear from your comment on #165:

We've observed when using the lambda callback based API (vs. using an async function or a returned promise) that the client's stream can be rendered un-writable.

if this might be what you were seeing. Are you able to provide more details and a repro of that observation?

I think the possible "beforeExit" interaction should be looked into before merging this PR.


  1. I think there may also be an issue with span completion before we invoke agent.flush() in our lambda wrapper here:

https://github.com/elastic/apm-agent-nodejs/blob/8eee8c0adfca788b00854a53c58f4cdcdbc98337/lib/lambda.js#L50

Currently agent.flush() does not have a way to wait for and include inflight spans. This is https://github.com/elastic/apm-agent-nodejs/issues/2294

This is a separate issue, however.

trentm commented 3 years ago

I'm happy to help look into writing tests on the agent for this, if you like.

astorm commented 3 years ago

Are you able to provide more details and a repro of that observation?

Here's two error cases we're seeing when we do not explicitly reference that socket (i.e. with the master branch client)

Error logs are here: lambda-error-logs.zip


With a simple lambda function handler that looks like this

const apm = require('elastic-apm-node').start({

})
exports.handler = apm.lambda(function handler (event, context, callback) {
          const response = {
              statusCode: 200,
              body: "hello function."
          };
          callback(null, response)
})

API gateway requests to the lambda produce errors that look like this

    {
        "log.level": "error",
        "@timestamp": "2021-10-02T18:42:52.091Z",
        "log": {
            "logger": "elastic-apm-node"
        },
        "ecs": {
            "version": "1.6.0"
        },
        "message": "APM Server transport error: Error: write after end\n    at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:261:12)\n    at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:305:21)\n    at Client.sendTransaction (/var/task/node_modules/elastic-apm-http-client/index.js:569:15)\n    at Instrumentation.addEndedTransaction (/var/task/node_modules/elastic-apm-node/lib/instrumentation/index.js:241:22)\n    at Instrumentation.wrappedAddEndedTransaction [as addEndedTransaction] (/var/task/node_modules/elastic-apm-node/lib/instrumentation/async-hooks.js:53:34)\n    at Transaction.end (/var/task/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:250:32)\n    at captureAndMakeCompleter (/var/task/node_modules/elastic-apm-node/lib/lambda.js:48:11)\n    at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:61:20)\n    at Runtime.handler (/var/task/index.js:10:11)\n    at Runtime.wrappedLambda [as handler] (/var/task/node_modules/elastic-apm-node/lib/lambda.js:101:17)\n    at Runtime.handleOnce (/var/runtime/Runtime.js:66:25)"
    }

    {
        "log.level": "error",
        "@timestamp": "2021-10-02T18:42:52.111Z",
        "log": {
            "logger": "elastic-apm-node"
        },
        "ecs": {
            "version": "1.6.0"
        },
        "message": "APM Server transport error: Error: write after end\n    at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:261:12)\n    at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:305:21)\n    at Client.flush (/var/task/node_modules/elastic-apm-http-client/index.js:595:15)\n    at Agent.flush (/var/task/node_modules/elastic-apm-node/lib/agent.js:553:21)\n    at /var/task/node_modules/elastic-apm-node/lib/lambda.js:50:13\n    at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:65:9)\n    at Runtime.handler (/var/task/index.js:10:11)\n    at Runtime.wrappedLambda [as handler] (/var/task/node_modules/elastic-apm-node/lib/lambda.js:101:17)\n    at Runtime.handleOnce (/var/runtime/Runtime.js:66:25)"
    }

    {
        "log.level": "error",
        "@timestamp": "2021-10-02T18:42:52.111Z",
        "log": {
            "logger": "elastic-apm-node"
        },
        "ecs": {
            "version": "1.6.0"
        },
        "message": "Flush error: write after end"
    }

A single transaction makes it to to APM Server, but no further requests make it through until the next cold start.

Logs in the attached zip file are "case-callback.log"

+--------------------------------------------------+

With a simple lambda function that returns a promise

const apm = require('elastic-apm-node').start({})
exports.handler = apm.lambda(
  function handler (event, context) {
    return new Promise(function(resolve, reject){
          const response = {
              statusCode: 200,
              body: "hello function."
          };
          resolve(response)
    })
}
)

API gateway requests to the lambda produce error responses that look like this

    % curl -i 'https://xxx.execute-api.us-west-2.amazonaws.com/default/function-name'
    HTTP/2 500 
    date: Sat, 02 Oct 2021 18:59:47 GMT
    content-type: application/json
    content-length: 35
    apigw-requestid: Gl_VkhowPHcEP-w=

    {"message":"Internal Server Error"}

Logs in the attached zip are are "case-promise.log"