freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
538 stars 148 forks source link

Revamp the Solr commit strategy #581

Closed mlissner closed 8 years ago

mlissner commented 8 years ago

Currently we do commits as follows:

This is bad because every 10 minutes we have a delay, especially if we're indexing a lot of content, even though we're doing automatic commits every 15 seconds.

(I haven't been able to pin down this is happening though. I'd expect that the automatic hard commits would mean that the cron job hard commits do almost nothing. But I've definitely noticed issues at increments of 10 after each hour. Update: This happens because commits with opensearcher=True reload caches. Reloading a cache reloads the external file field, and that takes time.)

A much better way to do this will be:

This will make a lot of things faster since we'll be doing fewer commits total (every 30s instead of every 15s), will make results available on average in 15 seconds, instead of 5 minutes, and will simplify our commit strategy by stopping scripts from making commits. It should also remove the delays that are happening every ten minutes.

Should have researched this sooner, but, well, Solr is hard to get right, and this took quite a while to get right.

mlissner commented 8 years ago

This is hands down the best documentation

mlissner commented 8 years ago

Fixed in 9f250337e2a3559792bf5d5032400ddb5c070342

mlissner commented 8 years ago

So...I had some misunderstandings and problems in my original plan:

  1. Soft commits can't be so frequent, because they clear caches, reload external file fields (which we have for pagerank), and open new searchers. I'm tweaking this value up to 1 minute for a start.
  2. Reloading the External File Field is a big deal, and we need to do that in the background (or as infrequently as possible). The best way to do this is to make sure that it is reloaded during autowarming. That can be accomplished in two ways. First, you can set the autowarmcount for the queryResultCache and filterCache to higher values. The default is currently 0, so if we set these to 16, that will run the top 16 queries in the cache as part of autowarming before making the new searcher live. That will probably hit the external file field and force it to reload....depending on the queries.

    The second (and more complex, but reliable) solution is to explicitly set an autowarming query in the newSearcher listener that uses the external file field. Doing this forces it to reload during autowarming, and makes life good again.

I'm going to do all of the above, but the immediate problem (too many commits) needs to be fixed urgently, so I'm going to start there.