Pylons / waitress

Waitress - A WSGI server for Python 3
https://docs.pylonsproject.org/projects/waitress/en/latest/
Other
1.44k stars 164 forks source link

Memory/Resource leak #357

Closed fschulze closed 2 years ago

fschulze commented 2 years ago

With devpi-server I can now confirm a memory leak somehow related to waitress. I observed it on Debian and FreeBSD, but not on macOS. The last test I did was replacing waitress with cheroot and the leak is completely gone with that. I'm currently working on creating a minimal example for reproduction, but wanted to give a heads up in case someone else noticed something and started investigating. It is certainly possible it is related in how we use waitress in devpi-server, but I don't think we do anything special. I don't want to switch to cheroot, because it lacks some features and seems to be less performant.

fschulze commented 2 years ago

While trying to reproduce this, I realized it is not a leak, but it seems like waitress isn't cleaning up right after serving a request. If I reduce the number of threads, I can clearly see a upper limit of memory usage. If I follow up requests with large response bodies with requests with small bodies, the memory usage goes down. So depending on the usage pattern the memory usage can get very high if there were a bunch of large responses. Is there a way to fix this from my side, or does this need to be fixed in waitress?

slav0nic commented 2 years ago

pay attention to devpi_server.readonly.DictViewReadonly as i see, it leaked on every new request like /root/pypi/+simple/<pkgname>/ tothemoon (on both waitress / chroot servers) looks like in this case cheroot.wsgi is more memory effective (and it seems to work faster, but I'm don't dive deeper), but really does not get rid of the problem on your app level.

fschulze commented 2 years ago

@slav0nic using devpi-server directly for this leads down many dead ends of which DictViewReadonly is unfortunately one. There is an internal cache, but it has a size limit and eventually flattens out. I'm very sorry to have wasted your time there. They key insight is, that using sys._debugmallocstats() all memory used by Python is accounted for, including the cache above etc., but the virtual and resident memory of the process is much higher than what sys._debugmallocstats() reports.

I tried to create a minimal example, but so far I can only partially replicate the issue. I'm still missing some key interaction between what devpi-server does and what waitress does.

digitalresistor commented 2 years ago

Waitress does buffering of the output, those are stored in overflow able buffers, that may write data to disk once it reaches a certain high water mark. Those are then cycled out as they are drained.

The OverflowableBuffer is here: https://github.com/Pylons/waitress/blob/master/src/waitress/buffers.py#L190

A task (WSGI thread) will call write_soon on a channel:

https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L332

Which will check to make sure that we are not above outbuf_high_watermark, if we are, we block the WSGI task from continuing to add more data to the output buffers.

To avoid growing the output buffer unbounded, this was an issue because multiple tasks may re-use the same output buffer if sending data back to the same connection, we check to see if the output buffer has reached a high water mark itself, and if so we create a new one: https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L355

Then we just append the data to the existing buffer:

https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L361

Once the buffer is added to the back of the list, it goes through _flush_some which does the actual sending into the tcp/ip socket once that socket is marked writeable.

https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L245

That happens here.

We will send data in chunks:

https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L257, in a tight loop until send fails because the socket would block (and thus it returns 0 for the number of bytes sent). When the buffer is empty, and it is not the last one, it is popped from the list:

https://github.com/Pylons/waitress/blob/84cfc2b8ac81b55a7223f51f6d807f634d98c0f9/src/waitress/channel.py#L274

if it is the last one, it sticks around, but the task thread will recycle it eventually if the amount of data written to it is too much. This avoids the issue we had whereby buffers were growing unbounded.

We call .close() on the buffer when we drop it.

Append is fairly simple: https://github.com/Pylons/waitress/blob/master/src/waitress/buffers.py#L244

There's one issue I do see, I am not sure how easily I can test to make sure it fixes this issue, so lemme open a PR real quick and if you could give that a whirl I would appreciate it.

digitalresistor commented 2 years ago

@fschulze would you mind giving this change a test run: https://github.com/Pylons/waitress/pull/358/

It's the only issue I can see whereby when we overflow from BytesIO to file, we may end up keeping the BytesIO around without calling .close() on it. Eventually I would expect Python to clean that memory up because the object goes out of scope...

digitalresistor commented 2 years ago

pay attention to devpi_server.readonly.DictViewReadonly as i see, it leaked on every new request like /root/pypi/+simple/<pkgname>/ tothemoon (on both waitress / chroot servers) looks like in this case cheroot.wsgi is more memory effective (and it seems to work faster, but I'm don't dive deeper), but really does not get rid of the problem on your app level.

cheroot does not buffer outgoing responses, and each connection is directly tied to a WSGI thread. This means that if a client is slow to read, you end up holding up that thread for as long as it takes them to read the response. (my cheroot experience is a couple of years out of date now, so it may not be accurate anymore)

Waitress attempts to avoid that by buffering up to a certain amount of the response, so that the WSGI thread can start working on another request sooner.

fschulze commented 2 years ago

@bertjwregeer that seems to be it. Even with the default 50 threads, the memory doesn't grow above 7GB VIRT and 5GB RES in my initial testing. I'll let it run for a few days and report back.

fschulze commented 2 years ago

@bertjwregeer memory usage has been stable for the past 20 days

digitalresistor commented 2 years ago

@fschulze I have released a new beta version: https://pypi.org/project/waitress/2.1.0b0/

I would absolutely love your help in testing/validating that the various changes before I ship them!

fschulze commented 2 years ago

I have updated m.devpi.net and another instance and will let you know if anything comes up or is stable after a few days.

fschulze commented 2 years ago

@bertjwregeer so far nothing seems broken for me and memory usage is stable (though slightly higher, but that is likely unrelated).

digitalresistor commented 2 years ago

@fschulze thank you so much for testing! I'll likely release it sometime this weekend!

fschulze commented 2 years ago

It seems like the leak isn't gone completely. It is much much better than before, but very slowly it still leaks resident memory that is not accounted for by arena allocations of Python. I will investigate further.

This shouldn't hold up a release though.

digitalresistor commented 2 years ago

@fschulze I look forward to your investigation, in the mean time I have released 2.1.0: https://pypi.org/project/waitress/2.1.0/