IdentityPython / pyFF

SAML metadata aggregator
https://pyff.io/
Other
50 stars 37 forks source link

High Memory Usage #283

Open mic4ael opened 2 weeks ago

mic4ael commented 2 weeks ago

Code Version

2.1.2

Expected Behavior

The memory used by pyff is properly freed up after a request finishes.

Current Behavior

Each request that leads to a 500 HTTP error results in a memory increase by 300MB.

Possible Solution

To alleviate the issue the parsed tree needs to be cleared explicitly as shown in the diff below.

diff --git i/src/pyff/api.py w/src/pyff/api.py
index 1050efb..2f17438 100644
--- i/src/pyff/api.py
+++ w/src/pyff/api.py
@@ -4,6 +4,7 @@ from datetime import datetime, timedelta
 from json import dumps
 from typing import Any, Dict, Generator, Iterable, List, Mapping, Optional, Tuple

+import lxml.etree
 import pkg_resources
 import pyramid.httpexceptions as exc
 import pytz
@@ -297,12 +298,18 @@ def process_handler(request: Request) -> Response:
     except ResourceException as ex:
         import traceback

+        if isinstance(r, (lxml.etree._Element, lxml.etree._ElementTree)):
+            r.clear()
+
         log.debug(traceback.format_exc())
         log.warning(f'Exception from processing pipeline: {ex}')
         raise exc.exception_response(409)
     except BaseException as ex:
         import traceback

+        if isinstance(r, (lxml.etree._Element, lxml.etree._ElementTree)):
+            r.clear()
+
         log.debug(traceback.format_exc())
         log.error(f'Exception from processing pipeline: {ex}')
         raise exc.exception_response(500)

Steps to Reproduce

XML files which are stored under tmp/dynamic are 50MB in total in our case and that seems to lead to higher memory usage since pyff parses them into Python representation using lxml. Each request results roughly in a 300MB increase in memory which is not then freed up properly.

To reproduce the issue use the following pipeline file:

- when update:
  - load:
      - tmp/dynamic
      - tmp/static
- when request:
  - select:
  - pipe:
      - when accept application/samlmetadata+xml application/xml:
          - first
          - finalize:
              cacheDuration: PT12H
              validUntil: P10D
          - sign:
              key: tmp/default.key
              cert: tmp/default.crt
          - emit application/samlmetadata+xml
          - break
      - when accept application/json:
          - discojson
          - emit application/json
          - break

Run pyff with caching disabled:

PYFF_CACHING_ENABLED=False pyffd -f --frequency=1200 --loglevel=INFO -H 0.0.0.0 -P 8080 --pid_file $PWD/tmp/pyff.pid --dir=$PWD/tmp/ $PWD/tmp/mdx.fd

And run the following:

for i in `seq 1 20 `;
do
http --print hH 0.0.0.0:8080 'Accept: text/plain'
done

High memory consumption is most likely related to lxml not freeing up the memory properly.