elastic / apm-agent-nodejs

https://www.elastic.co/guide/en/apm/agent/nodejs/current/index.html
BSD 2-Clause "Simplified" License
582 stars 225 forks source link

Unexpected APM Server response when polling config #1749

Closed grahamrhay closed 3 years ago

grahamrhay commented 4 years ago

Describe the bug

We have a problem with our ES cluster, resulting in this error:

Error: Unexpected APM Server response when polling config
     at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)

which is obviously our problem. But the error is unhandled and taking down the process. Is that the expected behaviour? I can get over missing apm data, but having my service restart constantly is less than ideal. I don't want my uptime constrained to that of the APM server.

To Reproduce

Ruin your ES

Expected behavior

Is there any way to handle this kind of error, and e.g. just log it? I found this, but I don't want to trap all exceptions from the process (we already have a handler for that, to log & exit). Can I do something like:

apm.on("error", err => { log(err); })

Am I missing something obvious? Thanks.

Environment (please complete the following information)

How are you starting the agent? (please tick one of the boxes)

Additional context

Using env vars for config

watson commented 4 years ago

Bringing down the process is obviously not the intended behavior, so I'm curious as to why it appears to do so. So far I've not been able to reproduce the problem. The expected behavior in this scenario is that the agent should just log the message on the error log-level.

It would be interesting to see from where this error is thrown and also what its code and response properties are set to. That might help us investigating this.

grahamrhay commented 4 years ago

unfortunately I've since fixed our ES cluster, and can no longer reproduce it. let me see what I can dig out from the logs

grahamrhay commented 4 years ago
Error: Unexpected APM Server response when polling config
    at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)
    at .../node_modules/elastic-apm-http-client/index.js:196:21
    at IncomingMessage.<anonymous> (.../node_modules/fast-stream-to-buffer/index.js:20:9)
    at IncomingMessage.f (.../node_modules/once/once.js:25:25)
    at IncomingMessage.onend (.../node_modules/end-of-stream/index.js:36:27)
    at IncomingMessage.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on TLSSocket instance at: 
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 503,
  response: '{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [4065427520/3.7gb], which is larger than the limit of [4063657984/3.7gb], real usage: [4065427520/3.7gb], new bytes reserved: [0/0b], usages [request=16440/16kb, fielddata=341/341b, in_flight_requests=509722/497.7kb, accounting=1359924147/1.2gb], with { bytes_wanted=4065427520 & bytes_limit=4063657984 & durability=\\"PERMANENT\\" }"}'

is that any use?

watson commented 4 years ago

Thanks for the update. I'm unfortunately still not able to reproduce this issue. When I trigger this error, the agent just logs it to STDERR but doesn't crash. Have you by any chance configured a custom logger that handles errors any differently?

grahamrhay commented 4 years ago

not that I know of. OK, thanks, I'll see if I can reproduce it somehow

grahamrhay commented 4 years ago

I added:

return res.destroy(new Error("something bad"))

just before here, and you're right, the error was just logged, the first time. But the next time _pollConfig was called, it took down the process. (I set this to a lower value).

I have no idea why though. Does that also happen for you?

MartinKolarik commented 4 years ago

Got this today while upgrading elastic as well:

Unexpected APM Server response when polling config
node_modules/elastic-apm-http-client/index.js in processConfigErrorResponse at line 711
node_modules/elastic-apm-http-client/index.js in <anonymous> at line 196
node_modules/fast-stream-to-buffer/index.js in IncomingMessage.<anonymous> at line 20
node_modules/once/once.js in f at line 25
node_modules/end-of-stream/index.js in onend at line 36
events.js in emit at line 322
_stream_readable.js in endReadableNT at line 1187
internal/process/task_queues.js in processTicksAndRejections at line 84

My config:

module.exports = {
    logLevel: 'fatal',
    captureExceptions: false,
    captureSpanStackTraces: false,
    captureErrorLogStackTraces: 'always',
    errorOnAbortedRequests: true,
    abortedErrorThreshold: 30,
    transactionSampleRate: .2,
};

I have a custom unhandled exception handler that calls captureError() with some added info and it logged this:

error.custom.attributes.code | 503 error.custom.attributes.response | {"ok":false,"message":"Deployment is under maintenance / recovery."} error.custom.context.handled | false

watson commented 4 years ago

Thanks for all the updates. I tried again, using your suggestions, to recreate the problem, unfortunately still without any luck. Have you configured any central config properties inside of Kibana at all?

If you have not, and isn't expecting to use that feature, you can disable it completely by setting centralConfig to false. That way the agent will not try to poll the APM Server for this information.

grahamrhay commented 4 years ago

you're right, I can't make it happen in a vanilla express app. but how can it become an uncaught exception?

MartinKolarik commented 4 years ago

I was looking into this today and based on the client code, I don't quite see how it could crash and I also couldn't reproduce it locally. Yet, we have 49 crashes reported for yesterday in production.

I'm suspecting this to be an issue somewhere in the stream handling, as the error is propagated via many modules. Unfortunately, stack traces for errors point to the place where they are created, not where they are thrown, which makes it very hard to find what exactly is causing the throw here.

watson commented 4 years ago

Ah, interesting to hear that you can't reproduce it locally either. It could be related to how this module behaves together with some other module. Maybe some other module you have in your dependency chain changes something in regards to what happens when an HTTP response stream is destroyed πŸ€”

Maybe it could also be a timing issue where the issue doesn't appear when running both the Node.js app and the APM Server locally.

It could also be related to differences in how sockets are handled by the operation system, that's "hiding" the issue when we try to re-create it locally.

MartinKolarik commented 4 years ago

And is your Node.js app and APM Server not on the same machine in production?

No, but I'm also using the production APM server for my local testing.

Are you both running Mac locally and Linux in production?

Local Windows, production Linux.

grahamrhay commented 4 years ago

I'm running Debian (Stretch) in a vbox, on Arch, where I can reproduce it. the prod crashes are in Debian, and also in docker (alpine). The APM server is hosted at elastic.co

I can "fix" the issue by adding:

  req.on('socket', function(socket) {
    socket.on('error', err => {
      console.log("socket error", err);
    });
  });

at L204

MartinKolarik commented 4 years ago

@grahamrhay that's interesting, I was looking into that before as well, the socket error is emitted following this destroy() call, but there should be an error handler attached by the node HTTP client, which captures the error and emits it on request instead, which is then captured here. What you say would suggest that at some point in time the error handler from node core isn't bound.

pklime2 commented 4 years ago

We are seeing the same issue after upgrading to Elastic 7.7.0 yesterday. Our ES cluster and APM are all healthy, but we are getting the 503 error which after some time results in an unhandled exception and brings down our container instance.

Error: Unexpected APM Server response when polling config
    at processConfigErrorResponse (/usr/src/api/node_modules/elastic-apm-http-client/index.js:711:15)
    at /usr/src/api/node_modules/elastic-apm-http-client/index.js:196:21
    at IncomingMessage.<anonymous> (/usr/src/api/node_modules/fast-stream-to-buffer/index.js:20:9)
    at IncomingMessage.f (/usr/src/api/node_modules/once/once.js:25:25)
    at IncomingMessage.onend (/usr/src/api/node_modules/end-of-stream/index.js:35:27)
    at IncomingMessage.emit (events.js:322:22)
    at IncomingMessage.EventEmitter.emit (domain.js:482:12)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 503,
  response: 'sending request to kibana failed: Post http://endpoint.containerhost:9244/api/apm/settings/agent-configuration/search: read tcp 172.x.x.x:55862->172.x.x.x:9244: read: connection reset by peer'
}
Sending error to Elastic APM { id: '95cb0a096caaad34753251610a2be0bf' }
JeremyLoy commented 4 years ago

This is happening to my project as well, it seems that the timeout for connecting to elastic is really generous, 5 minutes. For anyone else trying to reproduce this, be aware that it will take a while with that timeout value.

MartinKolarik commented 4 years ago

Happened again on all services that use APM when upgrading to 7.8 today. Guess I'll set centralConfig: false everywhere.

itsTeknas commented 4 years ago

This happened to us too on production, Kibana version was 7.9.0, apm version was also 7.9.0. elastic-apm-node client was 3.6.1

We've since updated both versions to 7.9.1. and elastic-apm-node to ^3.7.0

Also we noticed that we were not able to access kibana just when this had happened because the JVM memory pressure was at 90% and the elastic.co was saying its unhealthy. (this is why we triggered the update)

We'll keep you posted on whether we continue to encounter the error after the update. Apm did manage to capture some snapshots of the error, attaching it here hoping it might help.

ef38d1b220c3466893c2ebee24c82494 us-east-1 aws found io_9243_app_apm

P.S. is there any workaround that i can apply in the meanwhile ?

pklime2 commented 4 years ago

P.S. is there any workaround that i can apply in the meanwhile ?

Our workaround was to dump Elastic for APM. The lack of response around this issue was not confidence inspiring so we decided to evaluate alternative solutions instead. Not a great answer, but we just couldn't have an agent crash our container instances every time we applied an update to our Elastic cluster.

MartinKolarik commented 4 years ago

@itsTeknas @pklime2 there is a very simple workaround mentioned in the comments above. Set centralConfig: false in the APM agent options.

itsTeknas commented 4 years ago

Upgrading both agent and elastic did solve our problem. We have seen no further crashes. We will still put into place centralConfig: false for safety.

I contacted elastic support with this, they gave a disappointing answer saying how to manage your ES indexes better and tried to sell a subscription for support :( but no escalation on the issue.

Regarding the APM issue, please follow up that Github issue you opened. As you are a Cloud Standard customer, we are able to assist you with break/fix issues on your ESS cluster. This question appears to be outside of this break/fix scope.

We have different subscription levels where we are able to provide additional support beyond break/fix as well as guaranteed SLAs: https://www.elastic.co/cloud/as-a-service/subscriptions. If that’s something you’d be interested in, please let me know and I'll be happy to put you in touch with your account representative!

Best regards, Kathy Li Elastic Support

I didn't expect a solution immediately, but i did expect a better escalation/response from the team.

bryantcj52 commented 4 years ago

We are seeing the same issue on one of our apps, for some reason the other 3 node apps are not affected.

apm-server: 7.9.0 elastic-apm-node: 2.17.3

We are upgrading the agent to 3.x soon but currently on 2.x.

Screenshot from 2020-09-08 12-59-40

skubot commented 3 years ago

Also having this issue with elastic-apm-node v3.8.0, running in Koa v2.13.0

Actually this problem might be easy to reproduce... Our ELASTIC_APM_SECRET_TOKEN links to an expired Elastic trial.

 APM Server transport error (503): Unexpected APM Server response
 {"ok":false,"message":"The requested resource is currently unavailable."}

 events.js:291
       throw er; // Unhandled 'error' event
       ^

 Error: Unexpected APM Server response when polling config
     at processConfigErrorResponse (/opt/app/node_modules/elastic-apm-http-client/index.js:711:15)
     at /opt/app/node_modules/elastic-apm-http-client/index.js:196:21
     at IncomingMessage.<anonymous> (/opt/app/node_modules/fast-stream-to-buffer/index.js:20:9)
     at IncomingMessage.f (/opt/app/node_modules/once/once.js:25:25)
     at IncomingMessage.onend (/opt/app/node_modules/end-of-stream/index.js:36:27)
     at IncomingMessage.emit (events.js:326:22)
     at endReadableNT (_stream_readable.js:1223:12)
     at processTicksAndRejections (internal/process/task_queues.js:84:21)
 Emitted 'error' event on TLSSocket instance at:
     at emitErrorNT (internal/streams/destroy.js:92:8)
     at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
     at processTicksAndRejections (internal/process/task_queues.js:84:21) {
   code: 503,
   response: '{"ok":false,"message":"The requested resource is currently unavailable."}\n'
 }
 [nodemon] app crashed - waiting for file changes before starting...

Starting with centralConfig: false fixed the issue for us.

arpitkulria commented 3 years ago

Any Update on this?

astorm commented 3 years ago

@arpitkulria No specific update on it, we're still looking for a set of steps to reproduce the issue in order to diagnose what's going on. Are you seeing this behavior? Could you share steps to reproduce it as well as a stack trace?

astorm commented 3 years ago

All -- we've had a chance to re-walk the code paths of both the fast-stream-to-buffer and eos libraries, and there's no obvious place that an unhandled error would be thrown. We've also setup a mock APM server that response with nothing but 503 errors and haven't been able to reproduce the problem.

This leaves us at a point where our best guesses at to what's going is there's a common third party dependency that's interacting with either the stream or client library that, unintentionally, causes this error to be thrown. It's also possible there's a bug/behavior in a specific version of Node.js that's bubbling this error up.

If you're seeing this problem and the centralConfig:false fix doesn't work for you, any of the following information would be useful to us in reproducing it. Once we've been able to reproduce the error, we'll be able to further diagnose and come up with a solution.

  1. Ideally -- full reproduction steps
  2. The output of npm ls (allowing us to see the specific versions of packages you have installed)
  3. The specific version of Node.js you're using (node --version)
  4. The operating system your application is running on, and the operating system (or Elastic Cloud provider) your APM Server is running on
  5. Whether you're using the cluster module either directly, or via a process manager like pm2
arpitkulria commented 3 years ago

@astorm all I did is stop kibana and got this error. After adding centralConfig: false service is working fine.

trentm commented 3 years ago

I fluked and found a local repro for this.

tl;dr: It only happens with node v12 (for me at least) and only a second failed attempt to retrieve central config from APM server.

Working from this part of the traceback above:

 Emitted 'error' event on TLSSocket instance at:
     at emitErrorNT (internal/streams/destroy.js:92:8)
     at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
     at processTicksAndRejections (internal/process/task_queues.js:84:21) {

I was able to match those line numbers to node v12's version of internal/streams/destroy.js. Using this patch to node_modules/elastic-apm-http-client/index.js (importantly the part of reducing the time for the next poll) and running a hello-world app using APM with node v12:

% node --version
v12.19.1

itself configured to talk to a mock APM-server that responds with 503 to central config requests, resulted in the crash.

I haven't dug in any further yet.

trentm commented 3 years ago

The "mockapmserver" that I mentioned is this file: https://gist.github.com/trentm/a6022b5b0ee9136af2bb14aa676a7d6d#file-mockapmserver-js-L78-L79

trentm commented 3 years ago

I finally got back to this... and couldn't repro for a while because I hadn't been specific about my "hello world app" before.

better details on a repro

Here are better details on a repro.

  1. Using the following demo app:
var apm = require('elastic-apm-node').start({
  serviceName: 'issue1749-two-config-crash-app',
  metricsInterval: 0,
  // captureExceptions:false so we can see the error on the console, rather
  // than just in the error object sent to APM server.
  captureExceptions: false,
  logLevel: 'debug',
  // disableInstrumentations: ['express']
})

if (process.env.PLZ_REQUIRE_EXPRESS) {
  require('express')
}
const http = require('http')
const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'text/plain')
  res.end('hi')
})
const port = 3000
server.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})
  1. with the following patch to "node_modules/elastic-apm-http-client/index.js" (mainly to get the shorter _scheduleNextConfigPoll() value):
diff --git a/index.js b/index.js
index c27fb71..52ae377 100644
--- a/index.js
+++ b/index.js
@@ -163,12 +163,16 @@ Client.prototype._pollConfig = function () {
   }

   const req = this._transport.get(opts, res => {
+    console.warn('XXX res.statusCode', res.statusCode)
     res.on('error', err => {
       // Not sure this event can ever be emitted, but just in case
+      console.warn('XXX in res.on(error), calling res.destroy(err)', err)
       res.destroy(err)
     })

-    this._scheduleNextConfigPoll(getMaxAge(res))
+    // XXX
+    // this._scheduleNextConfigPoll(getMaxAge(res))
+    this._scheduleNextConfigPoll(3)

     if (
       res.statusCode === 304 || // No new config since last time
@@ -180,26 +184,36 @@ Client.prototype._pollConfig = function () {
     }

     streamToBuffer(res, (err, buf) => {
-      if (err) return res.destroy(err)
+      if (err) {
+        console.warn('XXX streamToBuffer error, calling res.destroy', err)
+        return res.destroy(err)
+      }

       if (res.statusCode === 200) {
         // 200: New config available (or no config for the given service.name / service.environment)
         const etag = res.headers.etag
         if (etag) this._conf.lastConfigEtag = etag

+        // XXX Should not emit in the try/catch. If listener throws, then we catch it here.
         try {
           this.emit('config', JSON.parse(buf))
         } catch (e) {
+          console.warn('XXX config parse error, calling res.destroy', e)
           res.destroy(e)
         }
       } else {
-        res.destroy(processConfigErrorResponse(res, buf))
+        // XXX
+        console.warn('XXX config res.statusCode !== 200, calling res.destroy', res.statusCode, )
+        var cfgErr = processConfigErrorResponse(res, buf)
+        res.destroy(cfgErr)
+        // res.destroy(processConfigErrorResponse(res, buf))
       }
     })
   })

   req.on('error', err => {
-    this._scheduleNextConfigPoll()
+    console.warn('XXX req.on error', err)
+    this._scheduleNextConfigPoll(3)
     this.emit('request-error', err)
   })
 }
@@ -213,6 +227,7 @@ Client.prototype._scheduleNextConfigPoll = function (seconds) {
     this._configTimer = null
     this._pollConfig()
   }, seconds * 1000)
+  console.warn('XXX _scheduleNextConfigPoll in %ss', seconds)

   this._configTimer.unref()
 }
  1. and letting it run for long enough (~15-20s) to make a second centralConfig poll

Running node foo.js -> no crash Running PLZ_REQUIRE_EXPRESS=1 node foo.js -> crash

I.e.: just require'ing "express" makes a difference here.

More specific version details for the record:

% npm ls express
play@1.0.1 /Users/trentm/tm/play
└── express@4.17.1

% npm ls elastic-apm-node
play@1.0.1 /Users/trentm/tm/play
└── elastic-apm-node@3.10.0  -> /Users/trentm/el/apm-agent-nodejs5

[11:20:35 trentm@pink:~/el/apm-agent-nodejs5 (git:tags/v3.10.0)]
% git log1 -1
* 68f27d3 - (HEAD, tag: v3.10.0) 3.10.0 (#1936) (6 weeks ago) <Trent Mick>

% git diff --staged
diff --git a/package.json b/package.json
index 6aef98b..da46f93 100644
--- a/package.json
+++ b/package.json
@@ -87,7 +87,7 @@
     "console-log-level": "^1.4.1",
     "cookie": "^0.4.0",
     "core-util-is": "^1.0.2",
-    "elastic-apm-http-client": "^9.4.2",
+    "elastic-apm-http-client": "file:../apm-nodejs-http-client3",
     "end-of-stream": "^1.4.4",
     "error-stack-parser": "^2.0.6",
     "escape-string-regexp": "^4.0.0",

[11:21:34 trentm@pink:~/el/apm-nodejs-http-client3 (git:tags/v9.4.2)]
% git log1 -1
* 5a501f5 - (HEAD, tag: v9.4.2) 9.4.2 (3 months ago) <Trent Mick>

the issue

Working backwards from the crash point.

a smaller repro demo

var http = require('http')
var agent = new http.Agent({keepAlive: true})
var opts = {
  agent: agent,
  hostname: 'www.google.com',
  method: 'GET',
  path: '/'
}

function makeReq() {
  var req = http.get(opts, function (res) {
      console.warn('res.socket "error" event listeners before end:', res.socket.listeners('error'))
      res.on('data', function (chunk) { })
      res.on('end', function () {
        console.warn('res.on end')
        console.warn('res.socket "error" event listeners after end:', res.socket.listeners('error'))
        res.destroy(new Error('call the whole thing off'))
      })
      res.on('error', function (err) {
        console.warn('res.on error', err)
      })
  })
  req.on('error', function (err) {
    console.warn('req.on error', err)
  })
}

makeReq()

setTimeout(function () {}, 10000000) // don't exit right away

Run that "bar.js" with node v12:

% node bar.js
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
res.on end
res.socket "error" event listeners after end: []
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

Run it with NODE_DEBUG=http to see more of what is going on in the node core http client.

% NODE_DEBUG=http node bar.js
HTTP 64387: call onSocket 0 0
HTTP 64387: createConnection www.google.com:80: {
  agent: Agent {
    _events: [Object: null prototype] {
      free: [Function],
      newListener: [Function: maybeEnableKeylog]
    },
    _eventsCount: 2,
    _maxListeners: undefined,
    defaultPort: 80,
    protocol: 'http:',
    options: { keepAlive: true, path: null },
    requests: {},
    sockets: { 'www.google.com:80:': [] },
    freeSockets: {},
    keepAliveMsecs: 1000,
    keepAlive: true,
    maxSockets: Infinity,
    maxFreeSockets: 256,
    maxTotalSockets: Infinity,
    totalSocketCount: 0,
    [Symbol(kCapture)]: false
  },
  hostname: 'www.google.com',
  method: 'GET',
  path: null,
  port: 80,
  host: 'www.google.com',
  keepAlive: true,
  servername: 'www.google.com',
  _agentKey: 'www.google.com:80:'
}
HTTP 64387: sockets www.google.com:80: 1 1
HTTP 64387: outgoing message end.
(node:64387) Warning: Setting the NODE_DEBUG environment variable to 'http' can expose sensitive data (such as passwords, tokens and authentication headers) in the resulting log.
HTTP 64387: AGENT incoming response!
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
HTTP 64387: AGENT socket keep-alive
res.on end
res.socket "error" event listeners after end: []
HTTP 64387: CLIENT socket onFree
HTTP 64387: agent.on(free) www.google.com:80:
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

I'll start a fix on Monday.

trentm commented 3 years ago

Reopening. The GH PR linking automatically closed this, but this won't actually be fixed until there is an elastic-apm-http-client v9.5.1 release and a release of this agent to require that version.

trentm commented 3 years ago

Fixed now with elastic-apm-node@3.12.1 Thanks all, esp. @grahamrhay, for the reporting and persistence.