Closed grahamrhay closed 3 years ago
Bringing down the process is obviously not the intended behavior, so I'm curious as to why it appears to do so. So far I've not been able to reproduce the problem. The expected behavior in this scenario is that the agent should just log the message on the error
log-level.
It would be interesting to see from where this error is thrown and also what its code
and response
properties are set to. That might help us investigating this.
unfortunately I've since fixed our ES cluster, and can no longer reproduce it. let me see what I can dig out from the logs
Error: Unexpected APM Server response when polling config
at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)
at .../node_modules/elastic-apm-http-client/index.js:196:21
at IncomingMessage.<anonymous> (.../node_modules/fast-stream-to-buffer/index.js:20:9)
at IncomingMessage.f (.../node_modules/once/once.js:25:25)
at IncomingMessage.onend (.../node_modules/end-of-stream/index.js:36:27)
at IncomingMessage.emit (events.js:322:22)
at endReadableNT (_stream_readable.js:1187:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on TLSSocket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
code: 503,
response: '{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [4065427520/3.7gb], which is larger than the limit of [4063657984/3.7gb], real usage: [4065427520/3.7gb], new bytes reserved: [0/0b], usages [request=16440/16kb, fielddata=341/341b, in_flight_requests=509722/497.7kb, accounting=1359924147/1.2gb], with { bytes_wanted=4065427520 & bytes_limit=4063657984 & durability=\\"PERMANENT\\" }"}'
is that any use?
Thanks for the update. I'm unfortunately still not able to reproduce this issue. When I trigger this error, the agent just logs it to STDERR but doesn't crash. Have you by any chance configured a custom logger that handles errors any differently?
not that I know of. OK, thanks, I'll see if I can reproduce it somehow
Got this today while upgrading elastic as well:
Unexpected APM Server response when polling config
node_modules/elastic-apm-http-client/index.js in processConfigErrorResponse at line 711
node_modules/elastic-apm-http-client/index.js in <anonymous> at line 196
node_modules/fast-stream-to-buffer/index.js in IncomingMessage.<anonymous> at line 20
node_modules/once/once.js in f at line 25
node_modules/end-of-stream/index.js in onend at line 36
events.js in emit at line 322
_stream_readable.js in endReadableNT at line 1187
internal/process/task_queues.js in processTicksAndRejections at line 84
My config:
module.exports = {
logLevel: 'fatal',
captureExceptions: false,
captureSpanStackTraces: false,
captureErrorLogStackTraces: 'always',
errorOnAbortedRequests: true,
abortedErrorThreshold: 30,
transactionSampleRate: .2,
};
I have a custom unhandled exception handler that calls captureError()
with some added info and it logged this:
error.custom.attributes.code | 503 error.custom.attributes.response | {"ok":false,"message":"Deployment is under maintenance / recovery."} error.custom.context.handled | false
Thanks for all the updates. I tried again, using your suggestions, to recreate the problem, unfortunately still without any luck. Have you configured any central config properties inside of Kibana at all?
If you have not, and isn't expecting to use that feature, you can disable it completely by setting centralConfig
to false
. That way the agent will not try to poll the APM Server for this information.
you're right, I can't make it happen in a vanilla express app. but how can it become an uncaught exception?
I was looking into this today and based on the client code, I don't quite see how it could crash and I also couldn't reproduce it locally. Yet, we have 49 crashes reported for yesterday in production.
I'm suspecting this to be an issue somewhere in the stream handling, as the error is propagated via many modules. Unfortunately, stack traces for errors point to the place where they are created, not where they are thrown, which makes it very hard to find what exactly is causing the throw here.
Ah, interesting to hear that you can't reproduce it locally either. It could be related to how this module behaves together with some other module. Maybe some other module you have in your dependency chain changes something in regards to what happens when an HTTP response stream is destroyed π€
Maybe it could also be a timing issue where the issue doesn't appear when running both the Node.js app and the APM Server locally.
It could also be related to differences in how sockets are handled by the operation system, that's "hiding" the issue when we try to re-create it locally.
And is your Node.js app and APM Server not on the same machine in production?
No, but I'm also using the production APM server for my local testing.
Are you both running Mac locally and Linux in production?
Local Windows, production Linux.
I'm running Debian (Stretch) in a vbox, on Arch, where I can reproduce it. the prod crashes are in Debian, and also in docker (alpine). The APM server is hosted at elastic.co
I can "fix" the issue by adding:
req.on('socket', function(socket) {
socket.on('error', err => {
console.log("socket error", err);
});
});
at L204
@grahamrhay that's interesting, I was looking into that before as well, the socket error is emitted following this destroy() call, but there should be an error handler attached by the node HTTP client, which captures the error and emits it on request
instead, which is then captured here. What you say would suggest that at some point in time the error handler from node core isn't bound.
We are seeing the same issue after upgrading to Elastic 7.7.0 yesterday. Our ES cluster and APM are all healthy, but we are getting the 503 error which after some time results in an unhandled exception and brings down our container instance.
Error: Unexpected APM Server response when polling config
at processConfigErrorResponse (/usr/src/api/node_modules/elastic-apm-http-client/index.js:711:15)
at /usr/src/api/node_modules/elastic-apm-http-client/index.js:196:21
at IncomingMessage.<anonymous> (/usr/src/api/node_modules/fast-stream-to-buffer/index.js:20:9)
at IncomingMessage.f (/usr/src/api/node_modules/once/once.js:25:25)
at IncomingMessage.onend (/usr/src/api/node_modules/end-of-stream/index.js:35:27)
at IncomingMessage.emit (events.js:322:22)
at IncomingMessage.EventEmitter.emit (domain.js:482:12)
at endReadableNT (_stream_readable.js:1187:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
code: 503,
response: 'sending request to kibana failed: Post http://endpoint.containerhost:9244/api/apm/settings/agent-configuration/search: read tcp 172.x.x.x:55862->172.x.x.x:9244: read: connection reset by peer'
}
Sending error to Elastic APM { id: '95cb0a096caaad34753251610a2be0bf' }
This is happening to my project as well, it seems that the timeout for connecting to elastic is really generous, 5 minutes. For anyone else trying to reproduce this, be aware that it will take a while with that timeout value.
Happened again on all services that use APM when upgrading to 7.8 today. Guess I'll set centralConfig: false
everywhere.
This happened to us too on production, Kibana version was 7.9.0, apm version was also 7.9.0. elastic-apm-node client was 3.6.1
We've since updated both versions to 7.9.1. and elastic-apm-node to ^3.7.0
Also we noticed that we were not able to access kibana just when this had happened because the JVM memory pressure was at 90% and the elastic.co was saying its unhealthy. (this is why we triggered the update)
We'll keep you posted on whether we continue to encounter the error after the update. Apm did manage to capture some snapshots of the error, attaching it here hoping it might help.
P.S. is there any workaround that i can apply in the meanwhile ?
P.S. is there any workaround that i can apply in the meanwhile ?
Our workaround was to dump Elastic for APM. The lack of response around this issue was not confidence inspiring so we decided to evaluate alternative solutions instead. Not a great answer, but we just couldn't have an agent crash our container instances every time we applied an update to our Elastic cluster.
@itsTeknas @pklime2 there is a very simple workaround mentioned in the comments above. Set centralConfig: false
in the APM agent options.
Upgrading both agent and elastic did solve our problem. We have seen no further crashes.
We will still put into place centralConfig: false
for safety.
I contacted elastic support with this, they gave a disappointing answer saying how to manage your ES indexes better and tried to sell a subscription for support :( but no escalation on the issue.
Regarding the APM issue, please follow up that Github issue you opened. As you are a Cloud Standard customer, we are able to assist you with break/fix issues on your ESS cluster. This question appears to be outside of this break/fix scope.
We have different subscription levels where we are able to provide additional support beyond break/fix as well as guaranteed SLAs: https://www.elastic.co/cloud/as-a-service/subscriptions. If thatβs something youβd be interested in, please let me know and I'll be happy to put you in touch with your account representative!
Best regards, Kathy Li Elastic Support
I didn't expect a solution immediately, but i did expect a better escalation/response from the team.
We are seeing the same issue on one of our apps, for some reason the other 3 node apps are not affected.
apm-server: 7.9.0 elastic-apm-node: 2.17.3
We are upgrading the agent to 3.x soon but currently on 2.x.
Also having this issue with elastic-apm-node v3.8.0, running in Koa v2.13.0
Actually this problem might be easy to reproduce...
Our ELASTIC_APM_SECRET_TOKEN
links to an expired Elastic trial.
APM Server transport error (503): Unexpected APM Server response
{"ok":false,"message":"The requested resource is currently unavailable."}
events.js:291
throw er; // Unhandled 'error' event
^
Error: Unexpected APM Server response when polling config
at processConfigErrorResponse (/opt/app/node_modules/elastic-apm-http-client/index.js:711:15)
at /opt/app/node_modules/elastic-apm-http-client/index.js:196:21
at IncomingMessage.<anonymous> (/opt/app/node_modules/fast-stream-to-buffer/index.js:20:9)
at IncomingMessage.f (/opt/app/node_modules/once/once.js:25:25)
at IncomingMessage.onend (/opt/app/node_modules/end-of-stream/index.js:36:27)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on TLSSocket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
code: 503,
response: '{"ok":false,"message":"The requested resource is currently unavailable."}\n'
}
[nodemon] app crashed - waiting for file changes before starting...
Starting with centralConfig: false
fixed the issue for us.
Any Update on this?
@arpitkulria No specific update on it, we're still looking for a set of steps to reproduce the issue in order to diagnose what's going on. Are you seeing this behavior? Could you share steps to reproduce it as well as a stack trace?
All -- we've had a chance to re-walk the code paths of both the fast-stream-to-buffer
and eos
libraries, and there's no obvious place that an unhandled error would be thrown. We've also setup a mock APM server that response with nothing but 503 errors and haven't been able to reproduce the problem.
This leaves us at a point where our best guesses at to what's going is there's a common third party dependency that's interacting with either the stream or client library that, unintentionally, causes this error to be thrown. It's also possible there's a bug/behavior in a specific version of Node.js that's bubbling this error up.
If you're seeing this problem and the centralConfig:false
fix doesn't work for you, any of the following information would be useful to us in reproducing it. Once we've been able to reproduce the error, we'll be able to further diagnose and come up with a solution.
npm ls
(allowing us to see the specific versions of packages you have installed)node --version
)@astorm all I did is stop kibana and got this error.
After adding centralConfig: false
service is working fine.
I fluked and found a local repro for this.
tl;dr: It only happens with node v12 (for me at least) and only a second failed attempt to retrieve central config from APM server.
Working from this part of the traceback above:
Emitted 'error' event on TLSSocket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21) {
I was able to match those line numbers to node v12's version of internal/streams/destroy.js. Using this patch to node_modules/elastic-apm-http-client/index.js (importantly the part of reducing the time for the next poll) and running a hello-world app using APM with node v12:
% node --version
v12.19.1
itself configured to talk to a mock APM-server that responds with 503 to central config requests, resulted in the crash.
I haven't dug in any further yet.
The "mockapmserver" that I mentioned is this file: https://gist.github.com/trentm/a6022b5b0ee9136af2bb14aa676a7d6d#file-mockapmserver-js-L78-L79
I finally got back to this... and couldn't repro for a while because I hadn't been specific about my "hello world app" before.
Here are better details on a repro.
var apm = require('elastic-apm-node').start({
serviceName: 'issue1749-two-config-crash-app',
metricsInterval: 0,
// captureExceptions:false so we can see the error on the console, rather
// than just in the error object sent to APM server.
captureExceptions: false,
logLevel: 'debug',
// disableInstrumentations: ['express']
})
if (process.env.PLZ_REQUIRE_EXPRESS) {
require('express')
}
const http = require('http')
const server = http.createServer((req, res) => {
res.setHeader('Content-Type', 'text/plain')
res.end('hi')
})
const port = 3000
server.listen(port, () => {
console.log(`listening at http://localhost:${port}`)
})
diff --git a/index.js b/index.js
index c27fb71..52ae377 100644
--- a/index.js
+++ b/index.js
@@ -163,12 +163,16 @@ Client.prototype._pollConfig = function () {
}
const req = this._transport.get(opts, res => {
+ console.warn('XXX res.statusCode', res.statusCode)
res.on('error', err => {
// Not sure this event can ever be emitted, but just in case
+ console.warn('XXX in res.on(error), calling res.destroy(err)', err)
res.destroy(err)
})
- this._scheduleNextConfigPoll(getMaxAge(res))
+ // XXX
+ // this._scheduleNextConfigPoll(getMaxAge(res))
+ this._scheduleNextConfigPoll(3)
if (
res.statusCode === 304 || // No new config since last time
@@ -180,26 +184,36 @@ Client.prototype._pollConfig = function () {
}
streamToBuffer(res, (err, buf) => {
- if (err) return res.destroy(err)
+ if (err) {
+ console.warn('XXX streamToBuffer error, calling res.destroy', err)
+ return res.destroy(err)
+ }
if (res.statusCode === 200) {
// 200: New config available (or no config for the given service.name / service.environment)
const etag = res.headers.etag
if (etag) this._conf.lastConfigEtag = etag
+ // XXX Should not emit in the try/catch. If listener throws, then we catch it here.
try {
this.emit('config', JSON.parse(buf))
} catch (e) {
+ console.warn('XXX config parse error, calling res.destroy', e)
res.destroy(e)
}
} else {
- res.destroy(processConfigErrorResponse(res, buf))
+ // XXX
+ console.warn('XXX config res.statusCode !== 200, calling res.destroy', res.statusCode, )
+ var cfgErr = processConfigErrorResponse(res, buf)
+ res.destroy(cfgErr)
+ // res.destroy(processConfigErrorResponse(res, buf))
}
})
})
req.on('error', err => {
- this._scheduleNextConfigPoll()
+ console.warn('XXX req.on error', err)
+ this._scheduleNextConfigPoll(3)
this.emit('request-error', err)
})
}
@@ -213,6 +227,7 @@ Client.prototype._scheduleNextConfigPoll = function (seconds) {
this._configTimer = null
this._pollConfig()
}, seconds * 1000)
+ console.warn('XXX _scheduleNextConfigPoll in %ss', seconds)
this._configTimer.unref()
}
Running node foo.js
-> no crash
Running PLZ_REQUIRE_EXPRESS=1 node foo.js
-> crash
I.e.: just require'ing "express" makes a difference here.
More specific version details for the record:
% npm ls express
play@1.0.1 /Users/trentm/tm/play
βββ express@4.17.1
% npm ls elastic-apm-node
play@1.0.1 /Users/trentm/tm/play
βββ elastic-apm-node@3.10.0 -> /Users/trentm/el/apm-agent-nodejs5
[11:20:35 trentm@pink:~/el/apm-agent-nodejs5 (git:tags/v3.10.0)]
% git log1 -1
* 68f27d3 - (HEAD, tag: v3.10.0) 3.10.0 (#1936) (6 weeks ago) <Trent Mick>
% git diff --staged
diff --git a/package.json b/package.json
index 6aef98b..da46f93 100644
--- a/package.json
+++ b/package.json
@@ -87,7 +87,7 @@
"console-log-level": "^1.4.1",
"cookie": "^0.4.0",
"core-util-is": "^1.0.2",
- "elastic-apm-http-client": "^9.4.2",
+ "elastic-apm-http-client": "file:../apm-nodejs-http-client3",
"end-of-stream": "^1.4.4",
"error-stack-parser": "^2.0.6",
"escape-string-regexp": "^4.0.0",
[11:21:34 trentm@pink:~/el/apm-nodejs-http-client3 (git:tags/v9.4.2)]
% git log1 -1
* 5a501f5 - (HEAD, tag: v9.4.2) 9.4.2 (3 months ago) <Trent Mick>
% npm ls express
play@1.0.1 /Users/trentm/tm/play
βββ express@5.0.0-alpha.8
Working backwards from the crash point.
The crash is from an unhandled "error" event on a socket instance:
...
Emitted 'error' event on Socket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
...
That "error" event is being emitted due to the agent's apm-server client
calling res.destroy(err)
after an non-200 response from polling the APM
server for config here.
IncomingMessage.destroy([error])
emits "error" on the socket.
So why only on the second poll to APM Server?
The apm-server client creates an http.Agent with keepAlive=true
by default.
The intent is to use that keep-alive agent for requests to the APM server.
However, due to a bug in #53,
the agent (set as this._agent
) isn't set until after the request options
for apm-server requests are set.
This means that this._agent === undefined
here when request options are
set.
In other words, the keep-alive agent isn't getting used.
However, if client.config(...)
is called (again) on the client after initial creation, then those request options
will be created and this time this._agent
will be set.
There is one current case in the Node.js APM Agent where client.config()
is
called again: in agent.setFramework()
,
which is called automatically in these module instrumentations:
modules/hapi.js: agent.setFramework({ name: 'hapi', version, overwrite: false })
modules/restify.js: agent.setFramework({ name: 'restify', version, overwrite: false })
modules/koa.js: agent.setFramework({ name: 'koa', version, overwrite: false })
modules/express.js: agent.setFramework({ name: 'express', version, overwrite: false })
modules/fastify.js: agent.setFramework({ name: 'fastify', version, overwrite: false })
The result is that if any of those 5 modules is require
d in a user app, then
the apm-server client will switch to using its keep-alive agent.
The apm-server client always immediately polls for central config, so this
switch to the keep-alive agent will only occur for 2nd and subsequent calls.
So why isn't there an "error" event listener on that res.socket? Any why only in node v12? Because of this "TODO" in node core's http client handling of keep alive after the "end" of response: https://github.com/nodejs/node/blob/v12.19.1/lib/_http_client.js#L625-L626
socket.removeListener('error', socketErrorListener);
socket.removeListener('data', socketOnData);
socket.removeListener('end', socketOnEnd);
// TODO(ronag): Between here and emitFreeNT the socket
// has no 'error' handler.
In other words, if you res.destroy(err)
after the res "end" event in
node v12 with a keep-alive agent, you'll be emitting an "error" event on
a socket with no error event listener. That results in a crash or
uncaughtException handler.
var http = require('http')
var agent = new http.Agent({keepAlive: true})
var opts = {
agent: agent,
hostname: 'www.google.com',
method: 'GET',
path: '/'
}
function makeReq() {
var req = http.get(opts, function (res) {
console.warn('res.socket "error" event listeners before end:', res.socket.listeners('error'))
res.on('data', function (chunk) { })
res.on('end', function () {
console.warn('res.on end')
console.warn('res.socket "error" event listeners after end:', res.socket.listeners('error'))
res.destroy(new Error('call the whole thing off'))
})
res.on('error', function (err) {
console.warn('res.on error', err)
})
})
req.on('error', function (err) {
console.warn('req.on error', err)
})
}
makeReq()
setTimeout(function () {}, 10000000) // don't exit right away
Run that "bar.js" with node v12:
% node bar.js
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
res.on end
res.socket "error" event listeners after end: []
events.js:291
throw er; // Unhandled 'error' event
^
Error: call the whole thing off
at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Run it with NODE_DEBUG=http
to see more of what is going on in the node core
http client.
% NODE_DEBUG=http node bar.js
HTTP 64387: call onSocket 0 0
HTTP 64387: createConnection www.google.com:80: {
agent: Agent {
_events: [Object: null prototype] {
free: [Function],
newListener: [Function: maybeEnableKeylog]
},
_eventsCount: 2,
_maxListeners: undefined,
defaultPort: 80,
protocol: 'http:',
options: { keepAlive: true, path: null },
requests: {},
sockets: { 'www.google.com:80:': [] },
freeSockets: {},
keepAliveMsecs: 1000,
keepAlive: true,
maxSockets: Infinity,
maxFreeSockets: 256,
maxTotalSockets: Infinity,
totalSocketCount: 0,
[Symbol(kCapture)]: false
},
hostname: 'www.google.com',
method: 'GET',
path: null,
port: 80,
host: 'www.google.com',
keepAlive: true,
servername: 'www.google.com',
_agentKey: 'www.google.com:80:'
}
HTTP 64387: sockets www.google.com:80: 1 1
HTTP 64387: outgoing message end.
(node:64387) Warning: Setting the NODE_DEBUG environment variable to 'http' can expose sensitive data (such as passwords, tokens and authentication headers) in the resulting log.
HTTP 64387: AGENT incoming response!
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
HTTP 64387: AGENT socket keep-alive
res.on end
res.socket "error" event listeners after end: []
HTTP 64387: CLIENT socket onFree
HTTP 64387: agent.on(free) www.google.com:80:
events.js:291
throw er; // Unhandled 'error' event
^
Error: call the whole thing off
at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
at emitErrorNT (internal/streams/destroy.js:92:8)
at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
I'll start a fix on Monday.
Reopening. The GH PR linking automatically closed this, but this won't actually be fixed until there is an elastic-apm-http-client v9.5.1 release and a release of this agent to require that version.
Fixed now with elastic-apm-node@3.12.1 Thanks all, esp. @grahamrhay, for the reporting and persistence.
Describe the bug
We have a problem with our ES cluster, resulting in this error:
which is obviously our problem. But the error is unhandled and taking down the process. Is that the expected behaviour? I can get over missing apm data, but having my service restart constantly is less than ideal. I don't want my uptime constrained to that of the APM server.
To Reproduce
Ruin your ES
Expected behavior
Is there any way to handle this kind of error, and e.g. just log it? I found this, but I don't want to trap all exceptions from the process (we already have a handler for that, to log & exit). Can I do something like:
Am I missing something obvious? Thanks.
Environment (please complete the following information)
How are you starting the agent? (please tick one of the boxes)
agent.start()
directly (e.g.require('elastic-apm-node').start(...)
)elastic-apm-node/start
from within the source code-r elastic-apm-node/start
Additional context
Using env vars for config