Solr: Paths forward - Githubissues

hancush commented 4 years ago

Background

We currently recommend Solr for advanced search implementations. There's a lot to like about it: It's powerful, flexible, and infinitely configurable.

But there's also a catch. Collectively, we understand relatively little about Java and Solr's internals, which makes debugging difficult and, sometimes, quite scary. As we transition from Heroku to EC2, there is the added downside that the most basic Solr instance costs $20/month, with production instances starting at $60/month. This is a huge increase from hosting our own instances on EC2.

Proposal

I'd like to learn more about how Solr works and expand our documentation to include key concepts and advice for troubleshooting and tuning Solr instaces. I'd especially like to focus on settings that will allow us to operate within the constraints of Websolr. For example, the production Councilmatic Solr index uses almost 8GB of storage because we store huge text fields in the index -- the equivalent Websolr instance would cost $299/month.

Deliverables

Use Metro councilmatic to experiment with stored/indexed fields, with the goal of reducing the index size to < 1 GB (the limit for a small Websolr index).
Revise Solr setup documentation to impart key concepts and emphasize DIY, rather than rote copy/pasting.
Issue strong guidance on Haystack index setup, e.g., when to index/store a value.
Add section on tuning and troubleshooting, especially Java-specific issues, like heap size.

Timeline

1-2 days

jeancochrane commented 4 years ago

Another question I’m curious to explore is: how feasible would it be to manage our own Solr instance via a dedicated dyno in a Heroku app? That approach would be more similar to what we’re currently doing, and while we would obviously lose some nice features of Websolr like the dashboard and the secondary hot spare instance, it might be a lower-cost alternative for apps that require really large indexes.

This is a reasonably big question and isn't directly related to the goal of learning more about how Solr works, so we may want to open up a separate issue to explore it.

hancush commented 4 years ago

Also: How to enable SSL.

hancush commented 4 years ago

FWIW, I misinterpreted the Solr admin. The Metro index is only about 350 MB, or 0.3 GB, in size. Achievement unlocked, I guess!

hancush commented 4 years ago

FWIW, that's eligible for the $59/month plan: https://elements.heroku.com/addons/websolr

hancush commented 4 years ago

Chicago is about twice that size, with 540k docs, weighing in around 670 MB, or 0.6 GB. Still eligible for the $59/month tier, though closer to the upper limits (954 MB, 1m documents).

jeancochrane commented 4 years ago

Just want to call out that there's some excellent learning in this thread on how Solr uses memory: https://github.com/datamade/la-metro-councilmatic/issues/538

hancush commented 3 years ago

Been running into a lot of issues with handling special characters, so I'm going to spend some time this week on that (which may bleed a bit into Haystack best practices, as well).

hancush commented 3 years ago

The specific special character I had trouble with was the single quote, '.

I considered using the PatternReplaceCharFilter factory to replace or remove single quotes and handle the entire class of errors, as suggested in this thread, but I actually don't want that – I want the particular word it's to be protected during parsing, but other contractions to be parsed as normal. It also did not address my issue, because the result, its, was stemmed to its by a subsequent filter.

So what I ended up doing was adding it's to protwords.txt, as suggested in this thread.

Will report back on performance...

hancush commented 3 years ago

^ This didn't really work as I wanted it to, which is too bad. Back to the drawing board on single quotes, will update.

In the interim, I did learn a cool trick when investigating how to force LA Metro Councilmatic search results with identifiers matching the query to appear first: You can use the edismax parser to boost results where an arbitrary field matches an arbitrary value.

Solr provides a lot of ways to boost results, actually, but this one was the most precise. More discussion here: https://medium.com/@pablocastelnovo/if-they-match-i-want-them-to-be-always-first-boosting-documents-in-apache-solr-with-the-boost-362abd36476c

See the change set here: https://github.com/datamade/la-metro-councilmatic/pull/667

In the process of testing this, I also learned how to tell Solr to include the score in results viewed via the Solr admin interface: add *,score to the fl field.

hancush commented 3 years ago

Ran into a major blocker for the Websolr Heroku addon:

As far as the version you are currently using, if having Solr 6.x is mission critical then you would have to be on a Business plan or higher. Our lower tear plans are on Solr 4.10.x.

So, if we need functionality that is not available in Solr 4.10.x (a big one is date ranges, which rules this option out for SFM), at minimum, we'd have to use Business Small, which is $549/month. IMO, that's cost prohibitive.

Also, configuration must be done manually, and AFAICT, there is not a way to access Solr logs directly. I think we should look into another hosted Solr solution, or roll our own. It's a shame, because that means we're losing the ability to provision a Solr instance for new Heroku apps automatically. If we roll our own, I'd definitely be interested in a way to trigger the creation and configuration of a Solr instance with a Heroku deployment – as well as direct access to logs!!!

For me, though, this is a big strike against implementing Solr for new apps, especially given better supported add-ons (and Haystack compatibility) for Elasticsearch and our relative lack of expertise in Solr across the team.

derekeder commented 1 year ago

with #194 done and moving on with documentation for elasticsearch in #301, I think we can close this as we won't be using Solr with any future projects.

sound right @hancush @fgregg?

datamade / how-to

Solr: Paths forward #70

Background

Proposal

Deliverables

Timeline