digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Memory consumption urlexpand script #298

Closed limogin closed 6 years ago

limogin commented 6 years ago

I see an excesive consumption of memory of urlexpand script up to 99% of available memory. My server has 32G of RAM and I have only a search bin "test" query set in this moment.

python helpers/urlexpand.py

I understand I shouldn't consume this process so much. The installed version is the latest version currently available.

dentoir commented 6 years ago

Hi @limogin

I've seen issues (#82) with this script hanging, but never with memory consumption issues. We have an open bug already, we need to more robustly implement the resolving. You may try to run the script in the foreground (instead of the background, from cron) and see whether the script hangs or produces relevant error output. You'll see lots of output (mostly about SSL certificates from sites) but maybe you'll see some interesting output. You can paste the last hundred lines here.

Use the following commands:

cd /var/www/dmi-tcat/helpers
pyhon /var/www/dmi-tect/helpers/urlexpand.py
limogin commented 6 years ago

I paste here some oputput:

/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent-1.2.2-py2.7-linux-x86_64.egg/gevent/greenlet.py", line 536, in run
    result = self._run(*self.args, **self.kwargs)
  File "helpers/urlexpand.py", line 123, in job
    resp = requests.get(url, headers=request_headers, timeout=socket_timeout, verify=False)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 640, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 218, in resolve_redirects
    **adapter_kwargs
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 658, in send
    r.content
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 823, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 745, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 432, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 598, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 540, in _update_chunk_length
    line = self._fp.fp.readline()
AttributeError: 'NoneType' object has no attribute 'readline'
Tue Jan 23 10:56:53 2018 <Greenlet at 0x7fdb6c6ff190: job('http://ht.ly/Fj4N30hTzVL', 'test_urls')> failed with AttributeError

/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
dentoir commented 6 years ago

I've now seen this behavior too on one of our own servers. The script was still updating URLs but consuming a lot of memory. Looks like a typical memory leak.

limogin commented 6 years ago

If I can help you in any way, you point me out.

limogin commented 6 years ago

I will try to limit the ram amount and the priority until we can fix this issue:

0 * * * * su -l mywebuser -c '(cd /var/www/myapppath/; ulimit -m 1000000 && nice -n 19 python helpers/urlexpand.py)'

dentoir commented 6 years ago

See issue #82 for suggested fix (replacement by PHP script)

dentoir commented 6 years ago

Ping @limogin