datamade / django-councilmatic

:heartpulse: Django app providing core functions for *.councilmatic.org
http://councilmatic.org
MIT License
26 stars 16 forks source link

Reinventing Haystack #171

Closed reginafcompton closed 1 year ago

reginafcompton commented 6 years ago

The Councilmatic search functionality currently depends on Haystack. (It causes a lot of headaches, particularly, due its seemingly limitless hunger for memory.)

This document unravels some of the mysteries of Haystack, outlines why we might considering replacing it, and suggests how to incrementally make this happen.

@evz and I will collaborate on this project.

jeancochrane commented 6 years ago

I ran a quick memory profile of Haystack's update_index command, mostly cribbed from the tracemalloc docs:

import linecache
import os
import tracemalloc

from django.core.management.base import BaseCommand
from django.core.management import call_command

class Command(BaseCommand):
    help = 'Profiler'

    def display_top(self, snapshot, key_type='lineno', limit=10):
        snapshot = snapshot.filter_traces((
            tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
            tracemalloc.Filter(False, "<unknown>"),
        ))
        top_stats = snapshot.statistics(key_type)

        print("Top %s lines" % limit)
        for index, stat in enumerate(top_stats[:limit], 1):
            frame = stat.traceback[0]
            # replace "/path/to/module/file.py" with "module/file.py"
            filename = os.sep.join(frame.filename.split(os.sep)[-2:])
            print("#%s: %s:%s: %.1f KiB"
                % (index, filename, frame.lineno, stat.size / 1024))
            line = linecache.getline(frame.filename, frame.lineno).strip()
            if line:
                print('    %s' % line)

        other = top_stats[limit:]
        if other:
            size = sum(stat.size for stat in other)
            print("%s other: %.1f KiB" % (len(other), size / 1024))
        total = sum(stat.size for stat in top_stats)
        print("Total allocated size: %.1f KiB" % (total / 1024))

    def handle(self, *args, **options):

        tracemalloc.start()

        call_command('update_index')

        snapshot = tracemalloc.take_snapshot()
        self.display_top(snapshot)

Here are the results:

Indexing 15107 nyc bills
Top 10 lines
#1: <frozen importlib._bootstrap_external>:476: 213.7 KiB
#2: fields/related.py:594: 76.3 KiB
    return tuple(rhs_field for lhs_field, rhs_field in self.related_fields if rhs_field)
#3: python3.5/stringprep.py:187: 24.1 KiB
    0x1d7a8:'\u03c9', 0x1d7bb:'\u03c3', }
#4: python3.5/inspect.py:2165: 23.4 KiB
    sigcls=sigcls)
#5: fields/related.py:626: 15.2 KiB
    return tuple((lhs_field.column, rhs_field.column) for lhs_field, rhs_field in source)
#6: utils/functional.py:33: 13.2 KiB
    res = instance.__dict__[self.name] = self.func(instance)
#7: fields/related.py:590: 12.3 KiB
    return tuple(lhs_field for lhs_field, rhs_field in self.related_fields)
#8: db/utils.py:102: 11.5 KiB
    return func(*args, **kwargs)
#9: python3.5/stringprep.py:262: 10.8 KiB
    c9_set = set([917505] + list(range(917536,917632)))
#10: sql/query.py:1444: 7.7 KiB
    targets = tuple(r[0] for r in info.join_field.related_fields if r[1].column in cur_targets)
1214 other: 637.0 KiB
Total allocated size: 1045.4 KiB

That's not a lot of memory, but then again, I could be profiling this wrong. What makes us think that Haystack updating the index is causing the server to crash, again?

reginafcompton commented 6 years ago

Here's an example of a MemoryError that occurred while running update_index (when it was baked into import_data): https://sentry.io/datamade/nyc-council-councilmatic/issues/430697575/

jeancochrane commented 6 years ago

We did a little bit of memory debugging this morning to try to figure out what's going on. Some notes:

Two next steps:

  1. Do some research to figure out why Solr hangs on to so much memory during and after indexing
  2. Profile rebuild_index and update_index using a tool like mprof to try to get a more accurate picture of Python's contribution to the memory usage over time
reginafcompton commented 6 years ago

@jeancochrane - how do you feel about using the above Google doc for notes? It might be easier to consolidate and topically organize our findings into a memo-like document.

jeancochrane commented 6 years ago

@reginafcompton that makes sense! I missed the link for lack of reading closely 😅 Just updated the doc!