django / channels

Developer-friendly asynchrony for Django
https://channels.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.09k stars 799 forks source link

new connected channels may fail to receive message from group_send() #1244

Open DrChai opened 5 years ago

DrChai commented 5 years ago

It looks occasionally some channels won't receive messages from using group_send(), although they all successfully connected to our server. Disconnected and re-connect may resolve the issue. Here is the sample:

two clients connected via websocket and joined a same group partner_orders_1531

127.0.0.1:6379[1]> zrange asgi::group:partner_orders_1531 0 -1 withscores
1) "specific.mrOZIufu!GAaMMeISqdry"
2) "1549585254.4294083"
3) "specific.qUSiCEQD!TGlphabkczyd"
4) "1549585307.3211055"

then calling async_to_sync(channel_layer.group_send)

channel_layer = get_channel_layer()
async_to_sync(channel_layer.group_send)(partner_order_group_name, order_data)

but only one channel with channel name qUSiCEQD!TGlphabkczyd was able to receive the message and triggered

class PartnerOrderConsumer(WebsocketConsumer):
        ...
        def order_receive(self, event):
            self.send(text_data=event["text"])

related redis log : image

Pip & Environment: Python 3.6.6 channels==2.1.5 channels-redis==2.3.1 daphne==2.2.3 Django==1.11.10 The server is hosted on Ubuntu 16.04.5 LTS with 2vCPUs/4GB MEMORY by Digitalocean. We have two daphne instances and using nginx for load balancing.

xuehaoze commented 5 years ago

same issue here, please help, thanks!

bastbnl commented 5 years ago

I noticed the same behaviour a couple of weeks ago. My setup was python manage.py runserver behind nginx for SSL offloading. It did not appear to happen when the websocket connection was established to the runserver process without nginx in front.

It did not seem to appear when the software is running via daphne, but I'll have to check it in more detail to be sure.

qeternity commented 5 years ago

We experience this issue as well. At first we thought it was an issue with the channels-redis backend so we moved over to a rabbitmq backend, but this has not solved the problem. It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

carltongibson commented 5 years ago

It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

So, can we pin that down to a specific issue with Daphne? (If so a ticket there with a reproduce would be wonderful.) (If not some kind of reproduce would be helpful...)

carltongibson commented 5 years ago

...behind nginx for SSL offloading.

So is it an nginx config issue?

carltongibson commented 5 years ago

This isn't really actionable as is: there's nothing pointing to channels. Likely the issue is elsewhere in the stack.

I'm going to close for now. If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

qeternity commented 5 years ago

@carltongibson I'm not sure why you'd close it. Surely the fact that there's nothing actionable means that it warrants further investigation. It's not an NGINX issue because simply issuing a reload (kill -HUP) to daphne immediately fixes the problem. Also, anything that isn't group_send based (i.e. application level ping-pong) still work just fine.

carltongibson commented 5 years ago

It's not an addressable issue as it stands. Is it DigitalOcean? Is it Nginx? Is it Daphne? Is it Channels? If you read the comments here, there is nothing in common... so pending a reproduce there's nothing to do. We don't have capacity to chase down phantom issues with no details. needsinfo is a legitimate closing reason.

If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

qeternity commented 5 years ago

If you read the comments here, there is nothing in common

Channels/Daphne/group_send are the common denominator.

carltongibson commented 5 years ago

It did not seem to appear when the software is running via daphne

As I said, I'm very happy to investigate if anyone can provide a reproduce.

qeternity commented 5 years ago

So it would seem I owe @carltongibson an apology. After further debugging, this appears to be an odd behavior with nginx buffering.

@DrChai @xuehaoze try adding "proxy_buffering off;" to your nginx conf that handles websockets proxying?

EDIT: see below, this does not fix the problem.

carltongibson commented 5 years ago

@qeternity Thanks for the follow-up and the pointer. Glad you made progress!

xuehaoze commented 5 years ago

@qeternity @carltongibson Thanks for your reply, we will give it a try!

qeternity commented 5 years ago

After running this for a few days in production, this does NOT resolve the issue. group_send is still broken, but everything else works. Restarting daphne is the only solution. I have tried both redis and rabbitmq backends so this suggests it's internal daphne/channels.

mkoponen commented 5 years ago

@qeternity I have to upgrade an old project to this right now, and this issue is going to be a problem for me, as I'm using this same setup. Now, here's something from layers.py that concerns me, in async def group_send(self, group, message):

for channel in self.groups.get(group, set()): 
 try: 
  await self.send(channel, message) 
 except ChannelFull: 
  pass

Neither is there any logging in self.send if the channel is full. It just raises the exception, which is then ignored by this code. If this condition happens, you'll never know it.

So, my first thought is that the group messages are never completely cleaned up from the channel; possibly only in some specific use case, or possibly always. They'd get handled by the worker, but not removed from the channel. This would explain why it takes time for the scenario to happen, because it takes time for the channel to fill up.

So, if you already have a setup where you can make this bug happen, could you add a debug line instead of "pass" to the final line, or perhaps just throw the exception again? If we know the problem is this, then the search for the bug should be much easier. Also, if this is the reason, I'd expect raising the frequency of group_send's in your test code to make it happen faster, unless the problem is caused by a group_send happening simultaneously with some other method call, and only then leaving the channel with one slot less space in it.

This is all pure guesswork. This is the first time I have ever even looked at this project's code.

carltongibson commented 5 years ago

Hi @mkoponen. That's interesting. (Gives an avenue to work on at least...) I have some time to work on Channels at DjangoCon Europe next week, so I'll re-open to investigate. (If anyone wants to jump in before then, please do! 🙂)

qeternity commented 5 years ago

@mkoponen @carltongibson I'll debug this week, but we've used two separate backends (redis and rabbitmq) and neither have an implementation like that. That's for the InMemory layer, which isn't what we're using.

mkoponen commented 5 years ago

Ah, so it is. However, then my next question is, does this condition ever return true in the problem case?

if redis.call('LLEN', KEYS[i]) < tonumber(ARGV[i + #KEYS]) then

I don't understand all the concepts well enough yet to see what this line means, but I do notice that again the method would just fail silently if it didn't.

The first goal should be to figure out that when the problem starts, is the problem in pushing the information into Redis, or receiving the information from it.

EDIT: Another suggestion: Could there be a scenario in the two awaits at the top that seek to remove expired channels, such that they would simply await forever?

EDIT2: I see. This is from "simple" send-method

# Check the length of the list before send
# This can allow the list to leak slightly over capacity, but that's fine.
if await connection.llen(channel_key) >= self.get_capacity(channel):
    raise ChannelFull()

Capacity is also given as argument to the Lua code, so it is clearly trying to do this same check about exceeded capacity. Only it does it in Lua, whereas send() does it in Python. But I'm back to my original hypothesis: The channel gets full, and send starts failing silently.

EDIT3: Further hypothesis. Group messages get cleaned up only through expiry, and never through successful reception and handling of the message. Those users who send them very infrequently never notice the problem, because the slow expiry is enough to leave room for the new message whenever they need it. Those who send them frequently, hit capacity.

xiaoqiao99 commented 5 years ago

@qeternity Excuse me, Have it been solved now?

qeternity commented 5 years ago

This problem is still not resolved. We are using the rabbitmq backend. Our server locked up again and I went into the rabbitmq admin panel and saw that the queues all had 100 messages waiting in the queue. This is definitely the problem. It would appear that the asgi websockets workers disconnect from the queue and stop receiving messages, which then pile up to the channel limit.

carltongibson commented 5 years ago

We are using the rabbitmq backend

So it's not redis... — still looking for a decent reproduce.

I need to get DRF v3.10 out and then I'll have some time to dig into this, and other channels issues. Following up @mkoponen's lead seems most promising currently. If anyone has time to look, that's be great.

qeternity commented 5 years ago

Indeed, as I wrote above, we switched backends attempting to mitigate this but it had no impact.

maurice-g commented 5 years ago

Was anyone able to make any progress on this? Or find a way to reproduce it reliably? Having the same issue.

Kavarakis commented 5 years ago

I am having the same issue with this. I've tried to create different test app in same env specs and everything is working, but my current app is failing to send correct messages to redis via channel layers.

Any update?

carltongibson commented 5 years ago

This still needs digging into, I'm afraid.

I'm currently looking into issues around AsyncHttpConsumer (in the time I have 😝). Maybe this will turn out to be related to that. If not this is on my list for afterwards.

If someone could pin it down to an example project that reliably reproduced the issue, that would be more that half way to a solution.

Kavarakis commented 5 years ago

Hi guys, I've managed to fix my issues regarding this problem. The issue was rather stupid where I needed to open my WebSocket connection via frontend app. If I don't open WS connection and Django doesn't receive connection notification and successful handshake channels and Redis cannot send any messages via channel layers (tested and confirmed in my case). I don't see the logic behind this way but I've managed to remove my problem in this case.

I hope this will help someone.

Cheers!

maurice-g commented 4 years ago

Seems like I've been looking for the culprit in the wrong place as well, at least for some cases.

For the Googlers coming here: The issue was, that there wasn't actually a websockets connection anymore. We did listen to close and error events on the websocket connection in the frontend, and reacted by automatically trying to reconnect. Thus, I assumed, there would always be an active WS connection to django channels. Turns out that in some cases (such as just disconnecting your wifi) no close or error event will be thrown (tested in Chrome 77) and the browser will happily report the connection as active forever. The backend will close the connection from its side after ~50s (no idea where that number is coming from.. Channels / Daphne / Load balancer?) and therefore also send no further notifications to that channel. Solution for us was implementing manual heartbeat messages from frontend to the backend every few seconds. Contrary to my understanding of websockets, this is not handled by the Browser (whatever use the Ping/Pong has then..).

simkimsia commented 4 years ago

@maurice-g Sorry I am a bit new to this whole websockets. How do you implement manual heartbeat from frontend to backend?

I am using the following:

celery[redis]==4.3.0
channels==2.2.0  
channels-redis==2.4.0
tarikki commented 4 years ago

I faced this issue and found two culprits so far:

  1. Client silently disconnects If you're using django-channels to communicate to a websocket running in a browser, then this is the most likely cause. I noticed that if my users lock their device then come back to my web app after a certain amount of time has passed, then the connection will not be there anymore but it will not have thrown any errors either. The solution is to do what @maurice-g suggested and have a manual heartbeat between the browser and the backend

  2. group_send in channels_redis silently failing if channel over capacity Following the lead from @mkoponen I noticed that when you use group_send() with channels_redis as the backend, then there is the potential for group_send() to silently fail. When using group_send, before sending the current message the lua script will check the number of messages in the channel. If the channel is not yet at full capacity, it will send the message. However if the channel has hit full capacity then this channel will be skipped silently. I wrote a PR to fix this so the number channels that are at full capacity will be logged, which will help with debuggin.

SerhiiK17 commented 4 years ago

Here is always reproducible example of similar issue https://stackoverflow.com/questions/59195607/django-channels-daphne-and-docker-clients-stop-receiving-messages-when-the-fir

sabuhish commented 4 years ago

any solution? I am still facing the same issue stuck almost three days. What I think async_to_sync creates new connection each time it is called.. is there are any alternative?

kmiku7 commented 4 years ago

I encountered a similar problem ever, the client-side receives nothing from websocket after a period of time. After debug, my problem was caused by python3's bug: https://github.com/python/cpython/blob/v3.6.1/Lib/asyncio/locks.py#L163 asyncio does not deal with exceptions properly when acquiring a lock which makes other locks in deadlock state. After upgrading my local python it fixed. Hope it is helpful for you.

hamedkalantari commented 4 years ago

We encounter this problem as were developing a real-time game. As long as i could remember sometimes one may disconnect from websocket without even calling disconnect event. I think the problem is not about django channels cause we were using flask, but it's more about routers and switches, because sometimes intermediate routers and switches may drop the connection due to being idle and websocket will not call disconnect event. So we deploy a PING PONG messaging system that solves the problem for us. Although it's worth to say that we used a random data generator because PING or PONG messages were too light.

Although the problem @mkoponen mentioned is a serious problem.

acu192 commented 3 years ago

I'm seeing this issue in my app. I can confirm in my case:

  1. It is not a silent disconnect of the front-end.
  2. It happens only on group_send ... but just sometimes. The same code-path succeeds most of the time but fails occasionally (fail case = messages sent with group_send are not actually sent).
  3. It never fails when sending directly through channel layer (i.e. self.channel_layer.send(...)) .

There's a common situation in my app where a client will send two messages to the server. One message is a broadcast (thus the server does self.channel_layer.group_send) and the other is intended for a specific client (thus the server does self.channel_layer.group_send). Most of the time this works. But occasionally the broadcast message is silently dropped. This is the scenario that leads to my 1-3 conclusions in the list above. The broadcast message is sent first, which is how I know it's not a silent disconnect on the front-end (because the front-ends that gets the second message always gets it, but it doesn't always get the broadcast message that was sent first).

I'm using redis. Although, others here have stated they see this problem no matter what channel layer they use. I haven't tried other channel layers myself. I'm just digging into this right now...

Channels 3.0.1 Python 3.7

itsmehdi97 commented 2 years ago

I'm using uvicorn to serve my app and still having this issue. In my case the problem is totally gone when I use only 1 uvicorn worker, but the more workers are used the more messages are dropped.