Broken gunicorn implementation

balboah commented 11 years ago

The commit 2e251cecd4b002381cb6b65f1759de28b46523d3 seem to introduce problems for gunicorn. If I run gunicorn with more than 1 worker, I get bad file descriptor sockets and everything seem to fall apart.

Tested with gunicorn 0.13 and 0.17.

Running the commit before that one seems to work fine.

sontek commented 11 years ago

@balboah I had issues with that commit as well but I thought we had fixed it since then. I'll do some testing.

sontek commented 11 years ago

Yeah, multiple workers are broken.

sunbit commented 11 years ago

Any updates on this? If anyone has a clue of what modification could have caused this, i can take a look at it...

fcurella commented 11 years ago

just pinging to say I'm having the same issue here.

@sontek : which one are the commits that try to fix this issue?

bvallant commented 11 years ago

Seem to have the same problems as well... Websockets sometimes work/sometimes not... Generally spoken everything's quite buggy. I'd also like to investigate on this but don't know where to start?

Generally spoken I'd like to ask how do you generally recommend to run it in production?

stevesw commented 11 years ago

I think I have tracked down the issue to these two added lines in socketio.server.SocketIOServer.get_socket():

2e251cec socketio/server.py       (Alexandre Bourget       2012-11-15 17:29:02 -0500 131)         if sessid and not socket:
2e251cec socketio/server.py       (Alexandre Bourget       2012-11-15 17:29:02 -0500 132)             return None  # you ask for a session that doesn't exist!

To elaborate, it looks like each worker process is given its own SocketIOServer, and each SocketIOServer maintains its own dictionary (self.sockets) mapping sessids to Sockets.

During the connection process, two separate requests are made. First, a request is made to: /socket.io/1/ (which I guess is the handshake?), and then this is followed up by another call to something like /socket.io/1/websocket/419838463361 (which actually establishes the socket).

If one worker handles the first request, and then another worker handles the second request, the second request is going to fail because it won't be able to find the sessid saved by the first worker on its instance of SocketIOServer.

The reason it works sporadically is because sometimes you would coincidentally have the same worker handle both requests, so it will be able to find the sessid it saved. Of course, the more workers you have, the less likely that is to occur, which explains the behavior reported in issue #125.

Commenting out those 2 lines of code resolves the issue, but I'm not familiar enough with the project to know whether that's a suitable long term solution. (It will end up creating a duplicate Socket on the second worker. I haven't seen where they are being cleaned up but I guess it could cause a leak or be problematic in some other way?)

stevesw commented 11 years ago

After reviewing the code in socketio.handler.SocketIOHandler._do_handshake and the socket.io spec I don't see any reason that the handshake request needs to actually create a Socket, the only thing it uses it for is the socket.sessid which is a randomly generated number. It seems like you could just directly generate that random number for the handshake, and wait to create the Socket until the client makes the follow up 'transport connection' request.

I think that would resolve the connectivity issue for the websocket transport when running multiple workers, but as related issue #112 mentions, it will still be problematic for the polling based transports.

abourget commented 11 years ago

Please read invitation to Wednesday September 18th's sprint: https://groups.google.com/forum/#!topic/gevent-socketio/2OIRKA8M2uE

arnuschky commented 11 years ago

We are facing the same issue. Is there a solution yet to this problem? Has the workaround proven to be sustainable? I am unwilling just to comment some lines on the production machine...

abourget commented 11 years ago

Commenting that out will just mean you don't care if the socket was present before. If you don't save anything in the session.. then it might not change anything for your use case, but it breaks statefullness. It is quite tricky to handle socket.io requests on two workers. Some messages could v be pending on worker 1 while you connect to worker 2.. potentially relaunching background jobs, etc.

To keep statefullness, ideally, one client should always connect to the same worker. Even with front load balancers, the same path through reverse proxies should be kept for the duration of the socketio connection.

It involves devops stuff, amd I'm not sure exactly which part of that problen should be dealt with directly in gevent-socketio.

Any suggestions ?

arnuschky commented 11 years ago

Our application is not stateful, so no problems there. I installed the latest version of gevent-socketio and applied the patch. Unfortunately, the patch made no difference at all.

Our application (it's a game) is pretty simple: the website polls data on two different paths regularly (with different frequency). The server sends changes that occur on the server side of things to all connected clients. Up to that point, we did not notice any problems (but maybe they were already there).

We now added a simple chat functionality. The client sends a message to the server, which broadcasts it to all clients (sender included). The server receives the message from the client without problems, but the broadcast arrives at the clients only at about a third of the cases.

I am not even sure if our problem is related to the one outlined in this (and the other) issues, but we noticed that it works 100% when we reduce the gunicorn workers to 1.

@abourget Let me know if I can help/test in any way. I can't really suggest anything for the solution of this as we are only users and not at all familiar with the architecture of gevent.

NB: We were using the pip install up to now, which did not include the line in question. I installed the latest git, but I am not sure if this is advisable. Which version should we use on a production system?

abourget commented 11 years ago

About the chat issue:

see, having 2 or more workers, means 2 or more processes. Each process is independant, holds its own Python objects. The socket.io lib keeps a list of open sockets within it's process.. so if it happens 2 users are connected to this process, then "broadcasting" will send messages to those connected locally.. it doesn't know about users connected to another worker, nothing is shared between them.

In order to build something that would broadcast to all users available, you'd need an external system, like a Redis PubSub, or 0mq or RabbitMQ kind of thing.. and every worker would connect to that central element, subscribe, and publish. It would be the glue between the isolated workers.

Does that make sense ?

arnuschky commented 11 years ago

Ok, that makes a lot of sense. One question though: the sender doesn't see his own message usually. Does that just mean that he switched worker thread in the meantime?

Thanks for the replies and sorry that this went a bit OT.

abourget commented 11 years ago

Depends on which Mixin you're using. There are two broadcast functions in the BroadcastMixin.. one that echoes back to the user, one that doesn't.

arnuschky commented 11 years ago

Thanks for your comments. The nondeterministic behavior was definitely due to some instance mixup and a lack of basic understanding. We rewrote the code using redis pub/sub and everything works as expected!

abourget commented 11 years ago

Great to hear that :)

On Sun, Oct 13, 2013 at 6:15 PM, arnuschky notifications@github.com wrote:

Thanks for your comments. The nondeterministic behavior was definitely due to some instance mixup and a lack of basic understanding. We rewrote the code using redis pub/sub and everything works as expected!

— Reply to this email directly or view it on GitHubhttps://github.com/abourget/gevent-socketio/issues/132#issuecomment-26228437 .

jensenw commented 10 years ago

hv same problem here , already try using redis but still buggy , sometimes it can send message to browser but sometimes it can't . Any updates ?

ghost commented 10 years ago

Going back on the statelessness and sessions, am I correct to assume that the request will be sent only once for socket communication purposes? I am currently using pyramid with a session backed on redis. Is that be a good way to maintain a state – I am using the request.session dict for authentication purposes.

abourget commented 10 years ago

Request.session will not be present on other nodes/processes unless also backed by something like redis. Beaker supports it so you could use it there.

For each socket.io connection, there is only one pyramid request.. the rest is out of bound from a pyramid pt of view.

Alexandre On 2013-12-02 12:54 AM, "zman0225" notifications@github.com wrote:

Going back on the statelessness and sessions, am I correct to assume that the request will be sent only once for socket communication purposes? I am currently using pyramid with a session backed on redis. Is that be a good way to maintain a state – I am using the request.session dict for authentication purposes.

— Reply to this email directly or view it on GitHubhttps://github.com/abourget/gevent-socketio/issues/132#issuecomment-29596153 .

jrasanen commented 10 years ago

I have the same problem, so is Beaker a decent solution?

abourget / gevent-socketio

Broken gunicorn implementation #132