Metro-Records / la-metro-councilmatic

:metro: An instance of councilmatic for LA Metro
MIT License
6 stars 2 forks source link

Solr process eventually runs of out memory #538

Closed hancush closed 8 months ago

hancush commented 4 years ago

Offshoot of #534, related to #535.

After running without issue for about seven months, the production Solr process ran out of memory to accept new updates. Restarting the process freed up enough memory to resolve the issue, however this is only a temporary solution.

By default, Solr caps memory use ("heap size") at about half a gig. The docs suggest that this will be insufficient for most production setups. We probably don't need the recommended 10GB, but a middle ground may be more appropriate, especially given the size of our documents.

This thread contains some guidance on getting a handle on the memory consumption of Solr processes. This may help us determine a saner value.

This post also looks like a good resource for growing heap size and ways of addressing it.

hancush commented 4 years ago

This happened again. I'd like to escalate the priority of this issue.

hancush commented 4 years ago

Potentially related: https://github.com/datamade/django-councilmatic/issues/205

hancush commented 4 years ago

This article suggests that frequent updates require a bigger heap size. The staging Solr index is updated once per day. The production Solr index is updated every 15 minutes, or 96 times per day, fully a quarter of which reindex every bill in the database. That could be one reason we're seeing this on production, but not staging.

I monitored the production Solr instance while a full reindex was taking place. Heap use hovered between 40 and 60% of the allocated memory (half a gig). This doesn't seem like enough to run out of memory, so I wonder if there's a leak somewhere that gradually increases heap use. In that case, increasing heap size may only be a band-aid. I've increased heap size on a branch, but I'd actually like to hold off on merging and check on this once a week for a few weeks to get a handle on whether heap use is creeping up, or whether our errors come from more of a shock to the system.

fgregg commented 4 years ago

this does sound like a memory leak. first thing I would try in this case is to upgrade solr.

fgregg commented 4 years ago

i think you monitoring plan is also good.

hancush commented 4 years ago

This happened again after three weeks.

hancush commented 4 years ago

Yikes, this happened again on the new server. I'd like to escalate this issue in the next month or two.

hancush commented 4 years ago

Woah! This blog post is very, very helpful in tuning memory needs for Solr. In particular, it offers an explanation for how Solr uses memory. Most notably:

As you can see, a large portion of heap memory is used by multiple caches... The maximum memory a cache uses is controlled by individual cache size configured in solrconfig.xml.

So, a compelling reason why production index updates eventually fail is because Solr's various caches grow large enough that there is no longer sufficient heap space to make updates. This would also explain why restarting Solr frees up space. Since the staging site is nowhere near as regularly used as the production site, it would also explain why we don't see this on staging.

I think we will need to do a combination of limiting the max size of the caches, and perhaps giving the production Solr index a bit more memory to work with, to solve this issue. Will continue reading and update this thread.

hancush commented 4 years ago

You can view stats on the various caches in the Solr admin by selecting your core in the lefthand menu, then navigating to Plugins / Stats > Cache.

The big one for us is the document cache, which is at its max size of 512. With 2514 docs totaling to 652.8 MB, we can estimate that each our documents weighs about 0.25 MB. That means our document cache is around 128 MB in size, or a quarter of our available heap space.

There are some items in the query and filter caches, as well, but neither is close to full. According to this article, those are the ones that can potentially get quite big. I could spend a lot of time further spelunking here, but I think we'll see diminishing returns to the precision / time spent.

I'm going to bump the production Solr heap size up to 1 GB (double its current heap size) and continue monitoring this thread.

jeancochrane commented 4 years ago

Excellent research so far! Three questions:

  1. Do you have a sense of what causes the cache to expand? Is the cache populated by index updates, or only by direct queries?
  2. Is there a way for us to automatically expire the cache on a schedule, as we do with Django?
  3. Are there are any opportunities for us to add automated monitoring and alerting to heap usage? Seems like Solr offers metrics reporting and a logging API, is there a way we could perhaps hook these into Sentry?
hancush commented 4 years ago

Thank you for these excellent prompts, @jeancochrane! I've increased Solr's memory in production, and I'll keep an eye on this issue. If we wind up needing further work, I'll start with these questions.

antidipyramid commented 8 months ago

Closing since we're on ElasticSearch now 🙂