datamade / django-councilmatic

:heartpulse: Django app providing core functions for *.councilmatic.org
http://councilmatic.org
MIT License
26 stars 16 forks source link

Memory consumption: Steps forward #182

Closed reginafcompton closed 6 years ago

reginafcompton commented 6 years ago

The Councilmatic server had significant memory issues, beginning around midnight February 28. The /var/log/syslog shows that python started to kill processes (python invoked oom-killer) around 12:50 - after the execution of the Chicago cron (45 after) and LA Metro cron (40 after).

Mar  1 06:40:01 ip-10-0-0-124 CRON[849]: (datamade) CMD (/usr/bin/flock -n /tmp/lametro_dataload.lock -c 'cd $APPDIR && $PYTHONDIR manage.py import_data >> /tmp/lametro-loaddata.log 2>&1 && $PYTHONDIR manage.py compile_pdfs >> /tmp/lametro-compilepdfs.log 2>&1 && $PYTHONDIR manage.py update_index >> /tmp/lametro-updateindex.log 2>&1 && $PYTHONDIR manage.py data_integrity >> /tmp/lametro-integrity.log')

...

Mar  1 06:45:01 ip-10-0-0-124 CRON[2042]: (datamade) CMD (/usr/bin/flock -n /tmp/chicago_dataload.lock -c 'cd $APPDIR && $PYTHONDIR manage.py import_data >> /tmp/chicago-loaddata.log 2>&1 && $PYTHONDIR manage.py update_index --batch-size=50 --age=1 >> /tmp/chicago_updateindex.log 2>&1 && $PYTHONDIR manage.py send_notifications >> /tmp/chicago_sendnotifications.log 2>&1')

...

Mar  1 06:50:01 ip-10-0-0-124 kernel: [6173922.727963] python invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0

@evz and I rebooted the server, and then we watched the memory consumption, as crontasks executed. We noticed that the LA Metro update_index process required considerable memory: the process (Jetty) consumed about 15% of memory and doubled to around 30% inserting the data into the Solr index (Java). Such memory use could be hazardous, if it overlaps with other indexing processes (i.e., NYC and Chicago).

We identified several next steps:


Additionally, we noticed that the rtf conversion script for NYC sometimes requires longer than 15 minutes to complete (which delays NYC data imports). Let's replace the RTF --> HTML with the actual PDFs. It should be possible via this PR.

reginafcompton commented 6 years ago

What is age?

The SearchIndex class provides structured data to the search engine. (Note: the search engine is document-based – a single text blob that gets tokenized, analyzed, and indexed – much like a key-value store.) An instance of the SearchIndex can contain a get_updated_field function. This tells the search index which field has an "updated" timestamp. For Councilmatic, the bill model has an updated_at field, and we tell Haystack all about it. Hence, we can use the --age argument.

It looks like Chicago had a big data import day, resulting in some unusually large bill counts. I queried our Councilmatic database for bills updated in the last hour: it's 1704. This number aligns with what I saw in the update_index log (also, 1704).

In short, the --age argument works as expected and should be implemented in LA Metro (and other Councilmatics, including staging sites, that do not use it).

reginafcompton commented 6 years ago

Closing - I moved the last bullet point to issue #184