Open tlvu opened 2 years ago
@cehbrecht @fmigneault
Stats a week later, twitcher
memory went down from 23.73GiB
to 15.23GiB
but still way too high compared to 28.96MiB
of proxy
(Nginx).
$ docker stats proxy twitcher magpie thredds
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
fad5cec748b9 proxy 0.11% 28.96MiB / 125.8GiB 0.02% 8.65TB / 8.73TB 0B / 0B 10
4612ca2a1be0 twitcher 0.24% 15.23GiB / 125.8GiB 12.11% 9.18TB / 9.18TB 0B / 0B 51
2375b043f595 magpie 0.09% 296MiB / 125.8GiB 0.23% 79.3MB / 63.1MB 0B / 0B 17
81aaf9dcbdc8 thredds 103.82% 13.7GiB / 125.8GiB 10.89% 42.6GB / 8.57TB 0B / 0B 95
The only recurrent error in docker logs twitcher
is this one:
2022-03-07 21:15:40,921 INFO [TWITCHER|magpie.adapter.magpieowssecurity:230][ThreadPoolExecutor-0_0] User None is requesting 'read' permission on [/ows/proxy/thredds/dodsC/datasets/simulations/bias_adjusted/cmip5/pcic/bccaqv2/day_HadGEM2-AO_historical+rcp45_r1i1p1_19500101-21001231_bccaqv2.ncml.dods]
[2022-03-07 21:15:40 +0000] [152] [ERROR] Socket error processing request.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 438, in _error_catcher
yield
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 767, in read_chunked
chunk = self._handle_chunk(amt)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 720, in _handle_chunk
returned_chunk = self._fp._safe_read(self.chunk_left)
File "/usr/local/lib/python3.7/http/client.py", line 630, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(3498 bytes read, 4694 more expected)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 758, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 572, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 793, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(3498 bytes read, 4694 more expected)', IncompleteRead(3498 bytes read, 4694 more expected))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 271, in handle
keepalive = self.handle_request(req, conn)
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 343, in handle_request
util.reraise(*sys.exc_info())
File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 626, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/gthread.py", line 328, in handle_request
for item in respiter:
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 761, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(3498 bytes read, 4694 more expected)', IncompleteRead(3498 bytes read, 4694 more expected))
Any other clues as to how to diagnose this memory leak?
I'm wondering if the chunk size is too high for this kind of usage?
https://github.com/bird-house/twitcher/blob/master/twitcher/owsproxy.py#L58-L63
Or maybe the encoding error causes the self.resp
to be incorrectly closed/disposed and leaves something alive (but I would expect objects to die and be garbage collected after leaving the execution context)?
I'm not sure if this is necessarily a memory leak issue, more like a memory handling "optimization". While files are being streamed for download, the chunks will require a minimal amount of memoy on twitcher side and nginx probably only passes the stream, so no more memory needed for it. Since there can be (workers: 10 x threads: 4 = 40) processes, this accumulates for each chuck size over different downloads. https://github.com/Ouranosinc/birdhouse-deploy/blob/master/birdhouse/config/twitcher/twitcher.ini.template#L88-L89 Each process will keep hold of that memory for reuse the next time it needs it for another download instead of requesting more (not a twitcher thing, just generic OS memory handling). I don't think there is any pythonic method to explicitly release memory other than kill processes at some point.
In ESGF they want to use a security proxy with nginx configuration: https://github.com/cedadev/django-auth-service
The other parts of the service are implemented with Python/Django.
Twitcher can to the same and is more convenient to use.
Maybe we need a new implementation that works for both (ESGF, Birdhouse ...). With Nginx, FastAPI, Flask, ... ?
Platforms that use Twitcher and Magpie in our cases are already being an nginx proxy. Is it used only for redirecting to the various services and it is already pretty big by itself without any rules for security. We must also remember that Magpie doesn't only do allow/deny session handling at the top level, but per sub-resource access for each service, over a large set of permissions.
From experience, I found that doing very specific rules/path/query filtering with nginx can be very complicated and produce a lot of unexpected behaviour. I would avoid this as much as possible.
Describe the bug
Twitcher is constantly taking lots of Memory.
Note below Twitcher is taking 23G of memory, while the front proxy only 30M, while having similar "NET I/O".
We have had a lot of Thredds activity lately so it's normal for the Cpu and Memory consumption to increase for Thredds since it has a caching feature.
But Twitcher saw a proportionally increase which is puzzling to me and stay like this during idle time.
Related bug on Magpie https://github.com/Ouranosinc/Magpie/issues/505
To Reproduce
PAVICS deployed at this commit from pour production fork of birdhouse-deploy https://github.com/Ouranosinc/birdhouse-deploy/commit/76dd3c86e57d6a96ca84d2124f3b019fe278645a which triple Thredds memory.
Lots of Thedds OpenDAP access on large
.ncml
file.Expected behavior
Twitcher Cpu and Memory would increase during Thredds transfer/activity but should go down during idle time, just like the
proxy
container.Context
[{"name": "Twitcher", "version": "0.6.2", "type": "application"}, {"name": "magpie.adapter.MagpieAdapter", "version": "3.19.1", "type": "adapter"}]