balderdashy / sails

Realtime MVC Framework for Node.js
https://sailsjs.com
MIT License
22.81k stars 1.95k forks source link

sails.js /socket.io/ URLs behind nginx leads to lots of 'upstream timed out' #3461

Closed serjrd closed 8 years ago

serjrd commented 8 years ago

I'm running several sails.js instances behind an nginx proxy with sticky sessions. And I keep seeing a lot of messages in my nginx error.log regarding sails.js /socket.io/ URLs timing out:

2016/01/04 20:55:15 [error] 12106#12106: *402088 upstream timed out (110: Connection timed out) while reading response header from upstream, client: x.x.x.x, server: example.com, request: "GET /socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=polling&t=1451930055065-4&sid=jvekCYDAcFfu0PLdAAL6 HTTP/1.1", upstream: "http://127.0.0.1:3001/socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=polling&t=1451930055065-4&sid=jvekCYDAcFfu0PLdAAL6", host: "example.com", referrer: "https://example.com/languageExchange/chat/63934"

2016/01/04 20:55:17 [error] 12105#12105: *402482 upstream prematurely closed connection while reading response header from upstream, client: y.y.y.y, server: example.com, request: "GET /socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=websocket&sid=QnAe1jiKEHgj-zlKAAKu HTTP/1.1", upstream: "http://127.0.0.1:3001/socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=websocket&sid=QnAe1jiKEHgj-zlKAAKu", host: "example.com"

It doesn't happen for every client, but the number of such messages is significant. And sails.js does not show any relevant errors.

How should I investigate the nature of these issues?

Here's what I've tried so far (and it didn't help):

Here's the relevant config files:

adapter: 'socket.io-redis'
location ^~ /socket.io/ {
        proxy_pass http://sails;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_no_cache true;
        proxy_cache_bypass true;
        proxy_redirect off;
        proxy_intercept_errors off;
}
serjrd commented 8 years ago

I even get no live upstream messages while sails is up and running:

2016/01/04 22:32:33 [error] 12107#12107: *437054 no live upstreams while connecting to upstream, client: z.z.z.z, server: example.com, request: "GET /socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=websocket&sid=8G2TfOsNOJMYHZOjAAD3 HTTP/1.1", upstream: "http://sails/socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=websocket&sid=8G2TfOsNOJMYHZOjAAD3", host: "example.com"
mikermcneil commented 8 years ago

What version of Sails and Node are you using?

I've never seen this error before personally. My first thought is that this has to do with Skipper upstreams, but that doesn't seem to be the case. Instead, it looks like this is an error that nginx logs when connections are terminated instead of a complete response being sent. This can happen for any number of reasons in userland code, such as streaming responses being interrupted, or responses being sent twice.

Hope that helps!

serjrd commented 8 years ago

sails: 0.11.3, node: 4.2.3

At some point I thought it was a matter of some OS limits (like max connections or max open files), but I can't see any signs of that either: my neither kern.log nor syslog have any warning messages of this kind.

Everything I've read so for about nginx suggests that this kind of errors is actually a sign that the backend fails to respond properly.

But I have no idea how to further investigate what's going on. Where would be a good place to put some extra logging code to see what sails thinks about it?

But I take it that if nginx reports that that there are no upstreams online, it does not route a request to sails at all. Which is really weird given that I capture every log message from sails instances with pm2 and it definitely was up and running at that moment.

I simply don't know where to go next.

japel commented 8 years ago

@serjrd do you have any keepalive config enabled in the upstream definition?

serjrd commented 8 years ago

@japel in my current configuration I use the defaults, but I tried settings 'keepalive' for upstream as well

@mikermcneil I think I've found some pattern. The main issue arise with upstream timed out (110: Connection timed out) while reading response header from upstream. I don't know what exactly causes this. One time I got this in my logs:

sails-3004-3 (out): 2016-01-06 13:54:30: Socket disconnected, but session could not be loaded to pass to configured disconnect handler: `sails.config.sockets.onDisconnect()`.  Will pass a fa
ke, empty session as argument to lifecycle callback.  Details:
sails-3004-3 (out):  Error: Could not parse session id from cookie of connecting socket, and then failed again when trying to use a generated cookie. Something has probably gone wrong with y
our session store configuration.
sails-3004-3 (out):   at _createError (/home/serjrd/sails/node_modules/sails-hook-sockets/lib/load-session-from-socket.js:44:17)
sails-3004-3 (out):   at loadSessionFromSocket (/home/serjrd/sails/node_modules/sails-hook-sockets/lib/load-session-from-socket.js:47:7)
sails-3004-3 (out):   at Socket.onSocketDisconnect (/home/serjrd/sails/node_modules/sails-hook-sockets/lib/on-connect.js:43:7)
sails-3004-3 (out):   at emitOne (events.js:77:13)
sails-3004-3 (out):   at Socket.emit (events.js:169:7)
sails-3004-3 (out):   at Socket.emit (/home/serjrd/sails/node_modules/socket.io/lib/socket.js:129:10)
sails-3004-3 (out):   at Socket.onclose (/home/serjrd/sails/node_modules/socket.io/lib/socket.js:418:8)
sails-3004-3 (out):   at Client.onclose (/home/serjrd/sails/node_modules/socket.io/lib/client.js:230:12)
sails-3004-3 (out):   at emitTwo (events.js:92:20)
sails-3004-3 (out):   at Socket.emit (events.js:172:7)
sails-3004-3 (out):   at Socket.onClose (/home/serjrd/sails/node_modules/engine.io/lib/socket.js:248:10)
sails-3004-3 (out):   at Socket.onError (/home/serjrd/sails/node_modules/engine.io/lib/socket.js:117:8)
sails-3004-3 (out):   at XHR.g (events.js:260:16)
sails-3004-3 (out):   at emitOne (events.js:77:13)
sails-3004-3 (out):   at XHR.emit (events.js:169:7)
sails-3004-3 (out):   at XHR.Transport.onError (/home/serjrd/sails/node_modules/engine.io/lib/transport.js:77:10)
sails-3004-3 (out):   at IncomingMessage.onClose (/home/serjrd/sails/node_modules/engine.io/lib/transports/polling.js:85:10)
sails-3004-3 (out):   at emitNone (events.js:67:13)
sails-3004-3 (out):   at IncomingMessage.emit (events.js:166:7)
sails-3004-3 (out):   at abortIncoming (_http_server.js:281:11)
sails-3004-3 (out):   at Socket.socketOnEnd (_http_server.js:425:7)
sails-3004-3 (out):   at emitNone (events.js:72:20)
sails-3004-3 (out):   at Socket.emit (events.js:166:7)
sails-3004-3 (out):   at endReadableNT (_stream_readable.js:905:12)
sails-3004-3 (out):   at doNTCallback2 (node.js:441:9)
sails-3004-3 (out):   at process._tickDomainCallback (node.js:396:17)
sails-3004-3 (out):  { [Error: Could not parse session id from cookie of connecting socket, and then failed again when trying to use a generated cookie. Something has probably gone wrong wit
h your session store configuration.] code: 'E_SESSION' }
sails-3004-3 (out): 2016-01-06 13:54:30: Socket HkUQAVG888fX-njZAAS6 disconnected

But most of the time it doesn't show anything wrong in sails logs. Just that the socket disconnected.

When this happens, it probably leads to nginx marking this instance as 'down' and trying to route requests to other instances. When a request get sent to 'wrong' instances (not the one it originally connected to) then a bunch of these error messages show up: upstream prematurely closed connection while reading response header from upstream. One for every instances. Again - nothing get added to sails logs.

After every instance fails to respond correctly nginx says no live upstreams while connecting to upstream and after it things get ok for some time.

I ran through my afterDIsconnect code, but it seems really straightforward and does call call the cb() function.

What else can I do to figure out if it's sails code or mine?

mikermcneil commented 8 years ago

Again it's hard to say- it sounds like this could also be coming from a route that sends a response twice.

What @japel said sounds promising- do you see this log with whatever nginx's equivalent of sticky sessions enabled? Alternately, have you tried disabling the long polling transport for socket.io? (See pt2 of the sailsjs platzi course). Another thing to try to troubleshoot is disabling the sockets hook altogether to isolate the problem (honestly makes sense to start there, but that also means disabling any parts of your app code that use sockets)

My hunch is that disabling the long polling transport is the place to start.

mikermcneil commented 8 years ago

@serjrd please keep us posted-- if you continue to see these error messages and can't pin it down to a userland bug after trying the approaches above, definitely open another issue and we'll help you get to the bottom of it.

sailsbot commented 8 years ago

Thanks for posting, @serjrd. I'm a repo bot-- nice to meet you!

The issue queue in this repo is for verified bugs with documented features. Unfortunately, we can't leave some other types of issues marked as "open". This includes bug reports about undocumented features or unofficial plugins, as well as feature requests, questions, and commentary about the core framework. Please review our contribution guide for details.

If you're here looking for help and can't find it, visit StackOverflow, our Google Group or our chat room. But please do not let this disrupt future conversation in this issue! I am only marking it as "closed" to keep organized.

Thanks so much for your help!

If I am mistaken, and this issue is about a critical bug with a documented feature, please accept my sincerest apologies-- I am but a humble bot working to make GitHub a better place. If you open a new issue, a member of the core team will take a look ASAP.

serjrd commented 8 years ago

@mikermcneil Yup, I do keep seeing those errors. And unfortunately I still don't know what causes it and how to reproduce it.

Here's my current understanding of what's going on. Please correct me if you think I'm wrong.

Requests like GET /socket.io/?__sails_io_sdk_version=0.11.0&__sails_io_sdk_platform=browser&__sails_io_sdk_language=javascript&EIO=3&transport=polling&t=1452263236852-7&sid=LsKpaMRbRBReCuUuAAFn HTTP/1.1 are handled directly by a socket.io server, so there's no userland code that can potentially mess it up.

You suggested disabling polling, but I don't think it'll help. For one thing, I saw the same error messages with transport=websocket and afaik socket.io people advice keeping polling enabled to ensure reliable client connections. Plus, since I can't really experiment much on a production and I can't reproduce this locally, I decided not to go this way.

I added max_fails to prevent nginx marking upstreams as down when it tries to send a request to a 'wrong' instance. So my current upstream section looks like this:

upstream sails {
    ip_hash;

    keepalive 64;

    server 127.0.0.1:3001 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:3002 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:3003 max_fails=3 fail_timeout=10s;
    server 127.0.0.1:3004 max_fails=3 fail_timeout=10s;
}

It did help to reduce the number of errors and I don't hear any actual complaints from users yet. Perhaps I'll try moving to haproxy instead of nginx. But I think it's not really an nginx issue.

julianlam commented 8 years ago

Hi there, a bit late of a response, though we're now seeing the exact same constellation of symptoms on a NodeBB install that uses nginx as a load balancer.

Our working theory is socket.io is returning a non-200 code error for something that happens only occasionally (e.g. perhaps a handshake is initiated but client exits immediately, causing an "upstream prematurely closed" error). Nginx interprets this as a fail and marks the instance down (especially with default settings, as the max is 1).

If it cycles through all backends (which can happen on busy sites), then we get a "no live upstreams" error).

ImTheDeveloper commented 8 years ago

@julianlam based on your comment I assume this is nothing terminal then really and more how nginx is handling its response to events that are happening in socket world. I'm wondering if I can change some setting to remove these from my logs as its getting really hard to find genuine problems when there is so much spam .

cburatto commented 7 years ago

I have the same issue and have opened a Serverfault question, including NGINX configuration and a comparison of the error logs and Sails verbose logs, along with messages from the JS console.

http://serverfault.com/questions/829100/socket-io-with-sails-js-node-js-and-nginx-on-ssl-bad-gateway

julianlam commented 7 years ago

Latest socket io has resolved this problem.

cburatto commented 7 years ago

@julianlam , I am using Sails 0.12.3. Package.json says they are using socket.io-client 1.4.3, so I assume the server-side is the latest as well. How can I check and upgrade if that's the case?

update

It seems the server is built on websocket-driver, only the client is socket.io -- anybody can attest to that? If so, any information on potential problems while using NGINX?

update2

Let me add more information about websocket-driver. I am developing on Windows and the local sails package has no information about websocket-driver. But I then installed the app on Ubuntu 16.04 and package.json shows this section:

"_phantomChildren": {
...
    "websocket-driver": "0.6.5",
    "yargs": "3.10.0"
  },

I am using npm 3, by the way.

cburatto commented 7 years ago

So the issue is a bad configuration on the NGINX sites-avalable conf file.

location /socket.io/ {
      proxy_pass http://sails/;
...
}

should be

location /socket.io/ {
      proxy_pass http://sails/socket.io/;
...
}

Pretty basic stuff: the "location" is not forwarded to the proxy_pass (why would it be, right?) -- So you need to make sure the socket requests are redirected to the exact socket endpoint.

ImTheDeveloper commented 7 years ago

I read that this may be due to sticky sessions issue. Nginx (free) does not support sticky sessions, therefore long polling will result in a time out error. If you are using Nginx Plus then you can enable sticky sessions and the issue will not be apparent. As suggested by @mikermcneil I suspect dropping the polling and going for basic websocket support when using free nginx will alleviate the issue. I will make the changes to my server config tonight and report back to see if this alleviates the issue.

ImTheDeveloper commented 7 years ago

Tried switching this to remove polling in production and development. Both actually caused more of the same issue to occur.