worker process crashing (with & without threading)

md5wm commented 5 years ago

Hi, for the past couple of months now, I've ran into an issue on how the python runworker handles exceptions, and subsequent crashing. This appears to be related to Daphne becoming backlogged or hung. Running runworker into a threaded environment/configuration made the crash-outs a bit less noticeable to end users. When a crash does happen, the web front end will appear as if it has hung, and does not update with any newly added transmissions.

The current installation style looks something like this: 3 Machines Running on VMWare, 1 physical machine.

VM 1: TrunkRecorder
VM 2: TrunkPlayer + nginx
VM 3: Redis / Memcache
HW 1: SQL server (Hosting InfluxDB, MariaDB/mySQL, and PostgreSQL, with nginx reverse proxy .

All three VM's are on a NVMe, on the same host hardware. There's no notable issue going on between the Virtual Machines, or VM's between the database machine. Attached below is the outputted error from the most recent crash...

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/channels/worker.py", line 151, in run
    consumer_finished.send(sender=self.__class__)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/django/dispatch/dispatcher.py", line 193, in send
    for receiver in self._live_receivers(sender)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/django/dispatch/dispatcher.py", line 193, in <listcomp>
    for receiver in self._live_receivers(sender)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/channels/message.py", line 117, in send_and_flush
    "immediately=True to send()." % (sender, self.retry_time)
RuntimeError: Failed to send queued message to daphne.response.NZUuUEYtaf!eiWTxbmPzZ after retrying for 20.00s.
You need to increase the consumption rate on this channel, its capacity,
or handle the ChannelFull exception yourself after adding
immediately=True to send().

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/channels/worker.py", line 151, in run
    consumer_finished.send(sender=self.__class__)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/django/dispatch/dispatcher.py", line 193, in send
    for receiver in self._live_receivers(sender)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/django/dispatch/dispatcher.py", line 193, in <listcomp>
    for receiver in self._live_receivers(sender)
  File "/home/radio/trunk-player/env/lib/python3.6/site-packages/channels/message.py", line 117, in send_and_flush
    "immediately=True to send()." % (sender, self.retry_time)
RuntimeError: Failed to send queued message to daphne.response.NZUuUEYtaf!RVPkqhWoHB after retrying for 20.00s.
You need to increase the consumption rate on this channel, its capacity,
or handle the ChannelFull exception yourself after adding
immediately=True to send().

As you can see, I've upped the processing time from 2 seconds up to 20 seconds for the channels package. The issue still persists - at absolutely random intervals

A google search throws me over to the django project here. They suggest to modify the CHANNEL_LAYERS to accommodate for the activity. As such, my CHANNEL_LAYERS in settings_local.py look like this...

CHANNEL_LAYERS = {
    "default": {
        "BACKEND": "asgi_redis.RedisChannelLayer",
        "CONFIG": {
            "hosts": [os.environ.get('REDIS_URL', 'redis://192.168.20.206:6379/9')],
                "channel_capacity": {
                    "http.request": 200,
                    "http.response!*": 10,
                    "websocket.send*": 20,
            },
            "capacity": 100,
        },
        "ROUTING": "radio.routing.channel_routing",
    },
}

CHANNEL_LAYERS and CACHES options were intentionally commented out from settings.py, in place of settings_local.py configuration.

It's worth noting that watching redis's database (ID 9) as specified above, it is being spammed to questionable levels. A rate that would likely deplete channel capacity when two or more users utilize the web service. Below is when there are NO users using the web service.

1558305312.249830 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.249857 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.300679 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.300705 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.352670 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.352698 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.403711 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.403751 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.454850 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.454891 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.506768 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.506806 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.558529 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.558557 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.609740 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.609794 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.661735 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.661776 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.712642 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.712675 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.764191 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.764218 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.815244 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.815269 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.870495 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.870537 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.922116 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.922143 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.972997 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305312.973024 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.023975 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.024002 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.074943 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.074970 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.125897 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.125923 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.177283 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.177308 [9 lua] "LPOP" "asgi:daphne.response.zpfFOeLwYD!"
1558305313.228809 [9 192.168.20.212:48920] "EVALSHA" "3640886a0c8901ca9188f5f7a5f7a346145b9c5f" "1" "asgi:daphne.response.zpfFOeLwYD!"

Any assistance or guidance to mitigate this crash would be appreciated. I'm not too in-depth with how python functions, but do know my way around it.

One of my overall questions is: Would it be possible to slow down the query-rate of the EVALSHA and LPOP?

I'd post on the google group, but for some reason Google is telling me the group isn't available right now.

OT: Have noticed that the CACHES setting doesn't get queried at all for the currently used client_class within redis. (Zero Activity)

dreinhold commented 5 years ago

Mine was crashing also (well still does I just have not looked) and once I put some code in to restart when it died I really never got time to go back and look.

So the channels part that is creating the web sockets and is what is using the redis is an old version, 1x and the new 2.x was a rewrite to fix the original 1.x issues. But I have not had time to upgrade the codebase to use the new 2.x style (I think I even does away with the redis requirement for how we use it).

As for the EALSHA and LPOP, the LPOP is message being read off, so its reading the messages and removing them each time. As for why its so chatty not sure yet.

I know this answers almost non of your questions. I think before any deep dive in this issue, moving to channel 2.x first would be the best use of time. I don't know how much time I will have anytime soon to work on it, but I really want to. I'll update with anything I find, and happy to take any patches for stuff you or anyone fines.

md5wm commented 5 years ago

Hey not a problem with not answering the questions, you response is helpful none the less.

In the near future, would you happen to have a separate repo/branch that i could pull the code from for the 2.x channels python package? FWIW, I'm perfectly fine if redis were to remain or if it were to go - adapt and overcome, right? :)

MaxwellDPS commented 3 years ago

@md5wm is this issue still there?

ScanOC / trunk-player

worker process crashing (with & without threading) #64