Closed hancush closed 1 year ago
Another question I’m curious to explore is: how feasible would it be to manage our own Solr instance via a dedicated dyno in a Heroku app? That approach would be more similar to what we’re currently doing, and while we would obviously lose some nice features of Websolr like the dashboard and the secondary hot spare instance, it might be a lower-cost alternative for apps that require really large indexes.
This is a reasonably big question and isn't directly related to the goal of learning more about how Solr works, so we may want to open up a separate issue to explore it.
Also: How to enable SSL.
FWIW, I misinterpreted the Solr admin. The Metro index is only about 350 MB, or 0.3 GB, in size. Achievement unlocked, I guess!
FWIW, that's eligible for the $59/month plan: https://elements.heroku.com/addons/websolr
Chicago is about twice that size, with 540k docs, weighing in around 670 MB, or 0.6 GB. Still eligible for the $59/month tier, though closer to the upper limits (954 MB, 1m documents).
Just want to call out that there's some excellent learning in this thread on how Solr uses memory: https://github.com/datamade/la-metro-councilmatic/issues/538
Been running into a lot of issues with handling special characters, so I'm going to spend some time this week on that (which may bleed a bit into Haystack best practices, as well).
The specific special character I had trouble with was the single quote, '
.
I considered using the PatternReplaceCharFilter
factory to replace or remove single quotes and handle the entire class of errors, as suggested in this thread, but I actually don't want that – I want the particular word it's
to be protected during parsing, but other contractions to be parsed as normal. It also did not address my issue, because the result, its
, was stemmed to its
by a subsequent filter.
So what I ended up doing was adding it's
to protwords.txt
, as suggested in this thread.
Will report back on performance...
^ This didn't really work as I wanted it to, which is too bad. Back to the drawing board on single quotes, will update.
In the interim, I did learn a cool trick when investigating how to force LA Metro Councilmatic search results with identifiers matching the query to appear first: You can use the edismax
parser to boost results where an arbitrary field matches an arbitrary value.
Solr provides a lot of ways to boost results, actually, but this one was the most precise. More discussion here: https://medium.com/@pablocastelnovo/if-they-match-i-want-them-to-be-always-first-boosting-documents-in-apache-solr-with-the-boost-362abd36476c
See the change set here: https://github.com/datamade/la-metro-councilmatic/pull/667
In the process of testing this, I also learned how to tell Solr to include the score in results viewed via the Solr admin interface: add *,score
to the fl
field.
Ran into a major blocker for the Websolr Heroku addon:
As far as the version you are currently using, if having Solr 6.x is mission critical then you would have to be on a Business plan or higher. Our lower tear plans are on Solr 4.10.x.
So, if we need functionality that is not available in Solr 4.10.x (a big one is date ranges, which rules this option out for SFM), at minimum, we'd have to use Business Small, which is $549/month. IMO, that's cost prohibitive.
Also, configuration must be done manually, and AFAICT, there is not a way to access Solr logs directly. I think we should look into another hosted Solr solution, or roll our own. It's a shame, because that means we're losing the ability to provision a Solr instance for new Heroku apps automatically. If we roll our own, I'd definitely be interested in a way to trigger the creation and configuration of a Solr instance with a Heroku deployment – as well as direct access to logs!!!
For me, though, this is a big strike against implementing Solr for new apps, especially given better supported add-ons (and Haystack compatibility) for Elasticsearch and our relative lack of expertise in Solr across the team.
with #194 done and moving on with documentation for elasticsearch in #301, I think we can close this as we won't be using Solr with any future projects.
sound right @hancush @fgregg?
Background
We currently recommend Solr for advanced search implementations. There's a lot to like about it: It's powerful, flexible, and infinitely configurable.
But there's also a catch. Collectively, we understand relatively little about Java and Solr's internals, which makes debugging difficult and, sometimes, quite scary. As we transition from Heroku to EC2, there is the added downside that the most basic Solr instance costs $20/month, with production instances starting at $60/month. This is a huge increase from hosting our own instances on EC2.
Proposal
I'd like to learn more about how Solr works and expand our documentation to include key concepts and advice for troubleshooting and tuning Solr instaces. I'd especially like to focus on settings that will allow us to operate within the constraints of Websolr. For example, the production Councilmatic Solr index uses almost 8GB of storage because we store huge text fields in the index -- the equivalent Websolr instance would cost $299/month.
Deliverables
Timeline
1-2 days