django / channels

Developer-friendly asynchrony for Django
https://channels.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.11k stars 800 forks source link

Django Channels Memory Leak on every message or connection #2094

Open cacosandon opened 6 months ago

cacosandon commented 6 months ago

I'm having a memory leak in Django Channels using uvicorn.

Every "memory crash" is a restart/deploy πŸ‘‡

Screenshot 2024-04-30 at 08 44 06

This not just happens within my project, but also with the tutorial basic chat example.

Here is the repository with that minimal example and memory profiling: https://github.com/cacosandon/django-channels-memory-leak

This happens locally, in the server, with/without DEBUG, just by reconnecting or sending messages (in the example I've added large messages so you can notice the memory leak).

The memory is never released.. even if the user disconnects after.

Screenshot 2024-04-30 at 08 55 10

I've proved it with memory-profiler and memray (both commands were added in the README so you can reproduce)

Dependencies:

Django==5.0.4
channels==4.0.0
channels-redis==4.2.0
uvicorn[standard]==0.29.0

# Profiling memory
memory-profiler==0.61.0
memray==1.12.0

I (think that) have really tried everything; deleting objects, manual garbage collection, etc. Nothing prevents the memory to increase and to never be released back. Any insights? πŸ™

carltongibson commented 6 months ago

Does the same thing happen with other protocol servers, such as hypercorn and Daphne?

cacosandon commented 6 months ago

I've tested daphne and hypercorn alongside uvicorn. All three show a similar pattern of memory usage, increasing steadily up to around 160 MiB. Despite this, they continue to consume more memory indefinitely, as monitored by memory-profiler.

The interesting thing is, while uvicorn shows a continuous rise in memory usage on the memray graph, the graphs for daphne and hypercorn are flat at 80 MiB. This discrepancy makes it unclear which tool provides more reliable data.

Here are the commands I used for each:

carltongibson commented 6 months ago

And can you see from any of the tools, memray perhaps, which objects are consuming the memory?

(I'd expect a gc.collect() to help here TBH)

carltongibson commented 6 months ago

@cacosandon Also, can you try with the PubSub layer, and see if the results are different there? Thanks.

cacosandon commented 6 months ago

Sure! I'll try to find time today to prepare a report on memray --leaks for each protocol server and test the PubSub layer. I'll get back to you soon, thanks!

cacosandon commented 6 months ago

So I tried multiple combinations. All HTML reports from memray are here: reports.zip

But below there are screenshots from them.

First, tried with Redis Channels (not PubSub) to get memory leaks.

With uvicorn

PYTHONMALLOC=malloc memray run --force -o output.bin -m uvicorn core.asgi:application + memray flamegraph output.bin --force --leaks

So, the leaks report include memory that was never released back, but I don't know how to interpret it correctly. Seems like AuthMiddleware was leaking but after removing it, the results are almost the same.

redis-channels-uvicorn-leaks.html

Screenshot 2024-05-01 at 08 46 10

Here is the screenshot of the uvicorn + leaks without AuthMiddleware:

redis-channels-uvicorn-without-authmiddleware-leaks.html

Screenshot 2024-05-01 at 08 52 05

Then tried with daphne

PYTHONMALLOC=malloc memray run --force -o output.bin -m daphne core.asgi:application + memray flamegraph output.bin --force --leaks

redis-channels-daphne-leaks.html

Screenshot 2024-05-01 at 08 50 50

The interesting part is that hypercorn showed no memory leaks (or maybe memray is not working here(?))

PYTHONMALLOC=malloc memray run --force -o output.bin -m hypercorn core.asgi:application + memray flamegraph output.bin --force --leaks

redis-channels-hypercorn-leaks.html

Screenshot 2024-05-01 at 08 54 14

Then, I tried with garbage collect for uvicorn and daphne. Same story for both.

memray run --force -o output.bin -m uvicorn core.asgi:application + memray flamegraph output.bin --force

redis-channels-uvicorn-gccollect.html

Screenshot 2024-05-01 at 08 55 31

memray run --force -o output.bin -m daphne core.asgi:application + memray flamegraph output.bin --force

redis-channels-daphne-gccollect.html

Screenshot 2024-05-01 at 08 56 03

And finally tried with PubSub for uvicorn and daphne

memray run --force -o output.bin -m uvicorn core.asgi:application + memray flamegraph output.bin --force

redis-pubsub-uvicorn.html

Screenshot 2024-05-01 at 08 57 34

memray run --force -o output.bin -m daphne core.asgi:application + memray flamegraph output.bin --force

redis-pubsub-daphne.html

Screenshot 2024-05-01 at 08 57 56

Just in case, I also removed all @profile above functions so the memory leaks were not affected by the memory-profiler library.

Hope all these reports help understanding the constant memory increase.

Right now I am trying to move my application to hypercorn so I can test it on staging, but websocket messages are empty πŸ€” . If I manage to solve it, I'll post the results here!

cacosandon commented 6 months ago

I've made it to make hypercorn work!

For some reason the websocket messages that were bytes-only were sent as {"text": None, "bytes": ... } just in hypercorn so the function of AsyncWebsocketConsumer always called the text handler.

Added a PR for that: https://github.com/django/channels/pull/2097

  async def websocket_receive(self, message):
      """
      Called when a WebSocket frame is received. Decodes it and passes it
      to receive().
      """
-     if "text" in message:
+     if "text" in message and message["text"] is not None:
          await self.receive(text_data=message["text"])
      else:
          await self.receive(bytes_data=message["bytes"])

Testing now in staging 🀞

cacosandon commented 6 months ago

Still there is a memory leak in my application with hypercorn πŸ˜•

It seems that memray just doesn't work with it, because memory-profiler does show a constant non-stop increase with any server protocol.

Screenshot 2024-05-01 at 22 40 21
carltongibson commented 6 months ago

Hi @cacosandon

Looking at the uploaded report, for e.g. redis-pubsub-daphe, the memory usage rises and the stabilises:

Screenshot 2024-05-02 at 08 21 09

The redis-channels-uvicorn-leaks report peaks at 168MB then falls to 151MB.

cacosandon commented 6 months ago

Hey @carltongibson, thank you for taking a look.

Yep, but if you zoom in redis-pubsub-daphe it just decelerates the memory increase (click on the graph). I think the first rise is just the correct memory usage, and then u see the memory leak.

Screenshot 2024-05-02 at 07 28 12

On the other hand, redis-channels-uvicorn-leaks experiences memory drops at intervals due to the PYTHONMALLOC=malloc flag; however, the overall memory usage continues to increase. If you examine each drop, you'll notice that the memory level after each fall is higher than before, without stopping.

Screenshot 2024-05-02 at 07 33 33
cacosandon commented 6 months ago

@carltongibson, do you have any clue about what's happening? Or what else can I try? I'm willing to try anything!

carltongibson commented 6 months ago

@cacosandon Given that you report it happening with the pub sub layer and different servers, not really. You need to identify where the leak is happening. Then it's possible to say something.

cacosandon commented 6 months ago

@carltongibson all my samples are from using RedisChannelLayer or RedisPubSubLayer, with uvicorn, daphne or hypercorn, with the tutorial example. My app has the problem too but I think it's a generalized problem.

Some things I've noticed:

I don't know how nobody else is having this problem. Maybe they just don't send large messages πŸ€”

carltongibson commented 6 months ago

Hi @cacosandon β€” are you able to identify where the leak is happening? Alas, I haven't had time to dig into this further for you. Without that it's difficult to say too much.

If you can identify a more concrete issue, there's a good chance we can resolve it.

cacosandon commented 6 months ago

@carltongibson no :( that's actually the thing that I'm struggling on: finding the memory leak πŸ˜“

I really tried every tool to detect it, but nothing noticeable or strange in the reports..

mitgr81 commented 6 months ago

I don't know how nobody else is having this problem. Maybe they just don't send large messages

I wouldn't assume that. πŸ˜‰ I've been silently watching and hoping you find more than I did when I looked. We had some success changing servers from daphne to uvicorn. We're still seeing some leakiness, but have resolved to using tools to monitor memory and restart services.

Here are some other things I've watched:

cacosandon commented 6 months ago

@mitgr81 what tools do you use to monitor and restart? For now I would love to implement that.

Will take a look on those resources!

mitgr81 commented 6 months ago

@mitgr81 what tools do you use to monitor and restart? For now I would love to implement that.

We're rocking a bespoke monitor for docker containers. It's pretty simple; essentially we label each container with a valid restart time and a memory limit (among other rules); and the "container keeper" looks for them.

mitgr81 commented 5 months ago

@cacosandon - Just curious if you've had any more luck than I have on this.

cacosandon commented 5 months ago

Hey @mitgr81! No updates yet.

We're using garbage collection on every message or new connection. This has helped a bit, but the memory still slowly increases and hits the max in about a week. We usually deploy and restart automatically the machines 2-3 times a week, which temporarily fixes the issue.

I hope the Channels team can look into this to see if it's a general problem with memory leaks. cc @carltongibson

domvie commented 4 months ago

We're running into the same issue. Daphne process used up over 50gigs of RAM on our server before it crashed.

jetale commented 4 months ago

Hey @mitgr81! No updates yet.

We're using garbage collection on every message or new connection. This has helped a bit, but the memory still slowly increases and hits the max in about a week. We usually deploy and restart automatically the machines 2-3 times a week, which temporarily fixes the issue.

I hope the Channels team can look into this to see if it's a general problem with memory leaks. cc @carltongibson

Hi @cacosandon, Are you using uvicorn or daphne in production? or hypercorn?

cacosandon commented 4 months ago

Hey @mitgr81! No updates yet.

We're using garbage collection on every message or new connection. This has helped a bit, but the memory still slowly increases and hits the max in about a week. We usually deploy and restart automatically the machines 2-3 times a week, which temporarily fixes the issue.

I hope the Channels team can look into this to see if it's a general problem with memory leaks. cc @carltongibson

Hi @cacosandon, Are you using uvicorn or daphne in production? or hypercorn?

Hey! uvicorn for now.

bigfootjon commented 4 months ago

@cacosandon: What are the variables you haven't changed? It sounds like you've swapped everything out (including your application's business logic) and the problem still exists which is troubling.

Have you tried simplifying the code down until the problem doesn't exist? AIUI from this thread the channel-layer concept seems to be the cause, but have you tried to stub out the channel-layer code in various ways to see where the problem originates? (if it's not the channel-layer, then the same principle applies: just keep axing code until you've got the simplest program possible that still repros the problem)

(investing in a test harness that artificially generates problematic conditions might aid in discovering the problem by speeding up the testing cycle, if you haven't already done so)

cacosandon commented 4 months ago

Hey @bigfootjon!

Running this repository: https://github.com/cacosandon/django-channels-memory-leak, you'll notice the memory leaks.

If you remove the sending of large messages, then the problem disappears unless you open/close connections fast enough to make the memory go up again. It's literally the basic setup of Django Channels, so I don't know what else I should remove.

I think the next step is going deeper into Django Channels source code and start modifying things there. I don't have a lot of time now to do this, so we have mitigated it by monitoring and restarting our servers (for now).

bigfootjon commented 4 months ago

If you find some time to investigate I think removing code from channels is the right approach. If the memory charts aren’t doing it then the opposite (finding ways NOT to allocate memory) is the only path forward

sevdog commented 3 months ago

I tried to investigate this and in my tests I found an interesting hint: changing the "serializer" changes the memory footprint by a lot.

Since both RedisChannelLayer and RedisChannelLayer use msgpack I tried overriding the serialization in the first one with standardlib json and I got a very different memory profile:

With msgpack: image

import json
import random
from channels_redis.core import RedisChannelLayer as _RedisChannelLayer

class RedisChannelLayer(_RedisChannelLayer):
    ### Serialization ###
    def serialize(self, message):
        """
        Serializes message to a byte string.
        """
        message = json.dumps(message).encode('utf-8')
        if self.crypter:
            message = self.crypter.encrypt(message)

        # As we use an sorted set to expire messages we need to guarantee uniqueness, with 12 bytes.
        random_prefix = random.getrandbits(8 * 12).to_bytes(12, "big")
        return random_prefix + message

    def deserialize(self, message):
        """
        Deserializes from a byte string.
        """
        # Removes the random prefix
        message = message[12:]

        if self.crypter:
            message = self.crypter.decrypt(message, self.expiry + 10)
        return json.loads(message.decode('utf-8'))

With JSON: image

As you can see the memory in the JSON test return back to a "normal" level (there is still some memory which was not released, but much less than with msgpack).

I tested this on python 3.10 inside an alpine-docker container.

Also seems that there is a problem with msgpack and python 3.12: https://github.com/msgpack/msgpack-python/issues/612

I would like to have more time to perform furhter tests and learn how to better use the memory profile, for the moment I hope that this may help someoneelse to find a solution.

acu192 commented 3 months ago

Cross linking https://github.com/django/channels/pull/1948 which is a long-standing known memory leak in channels.

carltongibson commented 3 months ago

Thanks @acu192. You're absolutely right. There's an unresolved chain of thought, and a likely fix sat there (for life reasons on my part I suppose.)

@cacosandon if you could test the linked PRs and feedback, that would help greatly.

cacosandon commented 2 months ago

Hey!

Yes, would love to help. Tried and raised some errors, surely because it's outdated and needs a rebase. Will ping him in the PR!

carltongibson commented 2 months ago

@cacosandon do note there's two related PRs. One for channels and one for channels-redis. You'll need to apply them both.

cacosandon commented 2 months ago

@carltongibson Already rebased both channels and channels-redis PRs. These are the updated dependencies:

Django==5.0.2
channels @ git+https://github.com/fosterseth/channels.git@clean_channels
channels-redis @ git+https://github.com/fosterseth/channels_redis.git@clean_channels
uvicorn[standard]==0.20.0
memory-profiler==0.61.0
memray==1.12.0

and these are the results for Uvicorn with PubSub:

pubsub-uvicorn

It seems the problem persists. I believe @sevdog's investigation around the serializer is likely the root cause, given its generic nature (whether using PubSub or not, and regardless of uvicorn, daphne, or hypercorn, even with a minimal example).

I can test with other settings later. Let me know!

carltongibson commented 2 months ago

@cacosandon OK, thanks for trying it.

As to root cause, I still need to get a minimal reproduce nailed down here, but yes maybe...

We're getting closer to it I suppose πŸ˜…

bigfootjon commented 2 months ago

@cacosandon, if you have some spare cycles could you try switching to the json serializer that @sevdog whipped up? https://github.com/django/channels_redis/pull/398

cacosandon commented 2 months ago

hey!

this is the test I'm running:

    async def receive(self, text_data):
        content = await self.decode_json(text_data)

        # Send message to room group
        await self.channel_layer.group_send(
            self.room_group_name, {"type": "chat.message", "content": content}
        )

        for i in range(500):
            # Create a struct of variable Mb from 1 to 5
            struct = bytearray(1024 * 1024 * random.randint(1, 5))
            message = bytes(struct)
            await self.channel_layer.group_send(
                self.room_group_name, {"type": "chat.binary", "message": message}
            )

            print(f"Sent message {i + 1} of 500")

            del struct
            gc.collect()

here is the repo: https://github.com/cacosandon/django-channels-memory-leak/blob/main/chat/consumers.py here is a video explaining how to perform the test (see in 2x):

https://github.com/user-attachments/assets/8fd5d5db-dffd-4d46-b540-91324beb8659

here are the results:

with original requirements.txt:

uvicorn + redis channels

uvicorn-redis-old

uvicorn + pubsub redis channels

uvicorn-pubsub-redis-old

with https://github.com/django/channels_redis/pull/398

uvicorn + redis channels

uvicorn-redis-new

uvicorn + pubsub redis channels

uvicorn-pubsub-redis-new

seems like something improved! crazy how the memory went on a rollercoaster in the last one. πŸ˜…

dogrocker commented 2 weeks ago

hey!

this is the test I'm running:

  • send two messages, inside two different 'rooms'
  • on every message receive, send another 500 heavy messages like this πŸ‘‡
    async def receive(self, text_data):
        content = await self.decode_json(text_data)

        # Send message to room group
        await self.channel_layer.group_send(
            self.room_group_name, {"type": "chat.message", "content": content}
        )

        for i in range(500):
            # Create a struct of variable Mb from 1 to 5
            struct = bytearray(1024 * 1024 * random.randint(1, 5))
            message = bytes(struct)
            await self.channel_layer.group_send(
                self.room_group_name, {"type": "chat.binary", "message": message}
            )

            print(f"Sent message {i + 1} of 500")

            del struct
            gc.collect()

here is the repo: https://github.com/cacosandon/django-channels-memory-leak/blob/main/chat/consumers.py here is a video explaining how to perform the test (see in 2x):

Screen.Recording.2024-09-22.at.12.59.41.1.mov here are the results:

with original requirements.txt:

uvicorn + redis channels

uvicorn-redis-old

uvicorn + pubsub redis channels

uvicorn-pubsub-redis-old

with django/channels_redis#398

uvicorn + redis channels

uvicorn-redis-new

uvicorn + pubsub redis channels

uvicorn-pubsub-redis-new

seems like something improved! crazy how the memory went on a rollercoaster in the last one. πŸ˜…

"Hello! Do you mean to use 'pubsub redis' and set 'serializer_format': 'json' to fix the issue?"