Investigate solr performance, diagnose bottlenecks and develop optimizations

landreev commented 1 year ago

This primarily concerns busy instances with large amounts of indexed metadata. Such as the IQSS prod. installation. There's some evidence of Solr becoming the performance bottleneck during high service load. Facet queries are especially suspect. But this has the potential to benefit any busy instance, hence the issue in the main repo.

@kcondon got the ball rolling on this and has already found and applied some useful info. He has started a google doc documenting his work. We may link it here and we will try to document the most useful things we find here (and in the guides, eventually). Kevin's out today, but we had a conversation and decided I'll go ahead an open the issue to make sure this work is scheduled and tracked; striking it while it's hot, etc.

landreev commented 1 year ago

Current plan of investigation (much simplified; per slack discussion):

Adjust heap (following the explanation in the guide of solr relying on the system heavily using memory outside its own allocated heap; in progress, has been reduced by 1/3 from what it was in prod., will keep experimenting).
Enabled slow queries logging, need to study collected data carefully (specifically around the times of service restarts, correlation with request rate, etc.)
Look specifically into facet queries, confirm whether they make the bulk of the "slow" list; look for patterns in performance degradation (some queries simply being expensive vs. queries that are not problematic during normal load becoming slow during peak load;.
Experiment with adjustments/optimizations (currently at the top of the list: facet threads settings; facet json api for performance(?); circuit breaker setting; defensive code on the collection page for facets)

(more to be added; work in progress)

cmbz commented 1 year ago

Spike issue to investigate Solr performance issues and identify specific actions to take. An issue will be created for each action that is identified. (As per 2023/07/30 prioritization meeting).
Sized as 10 during conversation with @landreev.
Moved to Sprint Ready

luddaniel commented 1 year ago

Related issue #8941 :)

landreev commented 1 year ago

There's another low hanging fruit-like improvement that can be handled as a compact, self-contained issue: Inside SearchIncludeFragment (the class file that runs the search for the collection page - i.e., where most searches issued by Dataverse originate) every search is literally run twice, in order to obtain the result counts for the object types NOT currently selected by the user. While the second search is necessary in order to obtain these count(s), it can be run in a more economical way, that can reduce the amount of work for solr to perform by up to 50%. The details are spelled out in a #dv-tech thread and can be copy-and-pasted into an issue once we open it.

landreev commented 1 year ago

@luddaniel I just looked at #8941. If it can be addressed by simply indexing fileCount as another numeric field during indexing, that does sound very straightforward, and I don't know why we haven't implemented it yet/have overlooked it. Generally, we absolutely need to reduce the number of SQL lookups that are required when we process lists of search results as they come from Solr. Hopefully we can address it soon.

landreev commented 11 months ago

We had a series of prod. auto-restarts again yesterday. This is definitely caused by solr stopping responding, this is confirmed now. A bandaid, but hopefully an effective temp. workaround was applied on the solr side, setting up "circuit breaker" config options. This makes solr start dropping new connections with 503 when certain threshold of memory and/or cpu utilization is reached, rather than hanging. (the observed search page freezes were accompanied by the cpu utilization approaching 800% - on the 8-cpu node that is dvn-cloud-solr). Thanks/kudos to @kcondon for researching/figuring out the above.

kcondon commented 11 months ago

Also noticed that clicking on basic search with no parameters spikes solr cpu. Clicking multiple times raises cpu use rapidly to where I can simulate the 700+% cpu load on solr. Refreshing the homepage or clicking on facets does not do this beyond an initial spike, then levels off. I was able to trigger the cpu circuit breaker, I think, using this method.

landreev commented 11 months ago

I'm implementing the following application side improvements, and hoping to make a PR soon. (more may be added to the list)

Adding app-side support for the Solr-side circuit breakers. Enabling those makes Solr start dropping requests with 503s (instead of dying) under heavy load. As currently implemented however, the page does not detect that condition and doesn't handle it in any special way, and simply shows zero search results/empty collection page. It needs to show something intelligent, along the lines of "search engine is experiencing heavy load, please try again later".
Optimizing the second search that the page repeats solely for the purpose of populating the type counts facets for the unchecked types. There is absolutely no need to repeat the same exact search, with all its super expensive parts, like all the other facets. There's no need to search for anything other than these one or two total counts; or to even run this second search, if all 3 type facets happen to be checked.
Adding nofollow to all the facet links and such, that we don't want crawlers/bots to touch. This will obviously only help with the nicer bots that take hints and follow rules, but better than nothing.
An off switch for facets support on the dataverse page. Again, facets are super expensive; it should be far better to be able to display the page with just the search cards and "facets are temporarily unavailable because of load issues" in the left column, than to have an empty page/have a "solr down" error page/have the page load take an hour.

IQSS / dataverse

Investigate solr performance, diagnose bottlenecks and develop optimizations #9635