Possible memory leak with Twitcher/Magpie

tlvu commented 2 years ago

Describe the bug

Twitcher is constantly taking lots of Memory.

Note below Twitcher is taking 23G of memory, while the front proxy only 30M, while having similar "NET I/O".

$ docker stats proxy twitcher thredds
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
fad5cec748b9        proxy               13.83%              30.29MiB / 125.8GiB   0.02%               4.29TB / 4.34TB     0B / 0B             10
4612ca2a1be0        twitcher            17.63%              23.73GiB / 125.8GiB   18.86%              4.82TB / 4.83TB     0B / 0B             51
81aaf9dcbdc8        thredds             118.74%             7.014GiB / 125.8GiB   5.58%               26.6GB / 4.25TB     0B / 0B             104

We have had a lot of Thredds activity lately so it's normal for the Cpu and Memory consumption to increase for Thredds since it has a caching feature.

But Twitcher saw a proportionally increase which is puzzling to me and stay like this during idle time.

Related bug on Magpie https://github.com/Ouranosinc/Magpie/issues/505

To Reproduce

PAVICS deployed at this commit from pour production fork of birdhouse-deploy https://github.com/Ouranosinc/birdhouse-deploy/commit/76dd3c86e57d6a96ca84d2124f3b019fe278645a which triple Thredds memory.

Lots of Thedds OpenDAP access on large .ncml file.

Expected behavior

Twitcher Cpu and Memory would increase during Thredds transfer/activity but should go down during idle time, just like the proxy container.

Context

Instance: https://pavics.ouranos.ca/twitcher/
Version: [{"name": "Twitcher", "version": "0.6.2", "type": "application"}, {"name": "magpie.adapter.MagpieAdapter", "version": "3.19.1", "type": "adapter"}]

tlvu commented 2 years ago

@cehbrecht @fmigneault

Stats a week later, twitcher memory went down from 23.73GiB to 15.23GiB but still way too high compared to 28.96MiB of proxy (Nginx).

$ docker stats proxy twitcher magpie thredds
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
fad5cec748b9        proxy               0.11%               28.96MiB / 125.8GiB   0.02%               8.65TB / 8.73TB     0B / 0B             10
4612ca2a1be0        twitcher            0.24%               15.23GiB / 125.8GiB   12.11%              9.18TB / 9.18TB     0B / 0B             51
2375b043f595        magpie              0.09%               296MiB / 125.8GiB     0.23%               79.3MB / 63.1MB     0B / 0B             17
81aaf9dcbdc8        thredds             103.82%             13.7GiB / 125.8GiB    10.89%              42.6GB / 8.57TB     0B / 0B             95

The only recurrent error in docker logs twitcher is this one:

2022-03-07 21:15:40,921 INFO  [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'read' permission on [/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/pcic/bccaqv2/day_HadGEM2-AO_historical+rcp45_r1i1p1_19500101-21001231_bccaqv2.ncml.dods]
[2022-03-07 21:15:40 +0000] [152] [ERROR] Socket error processing request.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 767, in read_chunked
    chunk = self._handle_chunk(amt)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 720, in _handle_chunk
    returned_chunk = self._fp._safe_read(self.chunk_left)
  File "/usr/local/lib/python3.7/http/client.py", line 630, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(3498 bytes read, 4694 more expected)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 758, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(3498 bytes read, 4694 more expected)', IncompleteRead(3498 bytes read, 4694 more expected))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 271, in handle
    keepalive = self.handle_request(req, conn)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 343, in handle_request
    util.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 626, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 328, in handle_request
    for item in respiter:
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 761, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(3498 bytes read, 4694 more expected)', IncompleteRead(3498 bytes read, 4694 more expected))

Any other clues as to how to diagnose this memory leak?

fmigneault commented 2 years ago

I'm wondering if the chunk size is too high for this kind of usage? https://github.com/bird-house/twitcher/blob/master/twitcher/owsproxy.py#L58-L63 Or maybe the encoding error causes the self.resp to be incorrectly closed/disposed and leaves something alive (but I would expect objects to die and be garbage collected after leaving the execution context)?

I'm not sure if this is necessarily a memory leak issue, more like a memory handling "optimization". While files are being streamed for download, the chunks will require a minimal amount of memoy on twitcher side and nginx probably only passes the stream, so no more memory needed for it. Since there can be (workers: 10 x threads: 4 = 40) processes, this accumulates for each chuck size over different downloads. https://github.com/Ouranosinc/birdhouse-deploy/blob/master/birdhouse/config/twitcher/twitcher.ini.template#L88-L89 Each process will keep hold of that memory for reuse the next time it needs it for another download instead of requesting more (not a twitcher thing, just generic OS memory handling). I don't think there is any pythonic method to explicitly release memory other than kill processes at some point.

cehbrecht commented 2 years ago

In ESGF they want to use a security proxy with nginx configuration: https://github.com/cedadev/django-auth-service

The other parts of the service are implemented with Python/Django.

Twitcher can to the same and is more convenient to use.

Maybe we need a new implementation that works for both (ESGF, Birdhouse ...). With Nginx, FastAPI, Flask, ... ?

fmigneault commented 2 years ago

Platforms that use Twitcher and Magpie in our cases are already being an nginx proxy. Is it used only for redirecting to the various services and it is already pretty big by itself without any rules for security. We must also remember that Magpie doesn't only do allow/deny session handling at the top level, but per sub-resource access for each service, over a large set of permissions.

From experience, I found that doing very specific rules/path/query filtering with nginx can be very complicated and produce a lot of unexpected behaviour. I would avoid this as much as possible.

bird-house / twitcher