Closed landreev closed 9 months ago
Current plan of investigation (much simplified; per slack discussion):
(more to be added; work in progress)
Related issue #8941 :)
There's another low hanging fruit-like improvement that can be handled as a compact, self-contained issue: Inside SearchIncludeFragment (the class file that runs the search for the collection page - i.e., where most searches issued by Dataverse originate) every search is literally run twice, in order to obtain the result counts for the object types NOT currently selected by the user. While the second search is necessary in order to obtain these count(s), it can be run in a more economical way, that can reduce the amount of work for solr to perform by up to 50%. The details are spelled out in a #dv-tech thread and can be copy-and-pasted into an issue once we open it.
@luddaniel I just looked at #8941. If it can be addressed by simply indexing fileCount as another numeric field during indexing, that does sound very straightforward, and I don't know why we haven't implemented it yet/have overlooked it. Generally, we absolutely need to reduce the number of SQL lookups that are required when we process lists of search results as they come from Solr. Hopefully we can address it soon.
We had a series of prod. auto-restarts again yesterday. This is definitely caused by solr stopping responding, this is confirmed now. A bandaid, but hopefully an effective temp. workaround was applied on the solr side, setting up "circuit breaker" config options. This makes solr start dropping new connections with 503 when certain threshold of memory and/or cpu utilization is reached, rather than hanging. (the observed search page freezes were accompanied by the cpu utilization approaching 800% - on the 8-cpu node that is dvn-cloud-solr). Thanks/kudos to @kcondon for researching/figuring out the above.
Also noticed that clicking on basic search with no parameters spikes solr cpu. Clicking multiple times raises cpu use rapidly to where I can simulate the 700+% cpu load on solr. Refreshing the homepage or clicking on facets does not do this beyond an initial spike, then levels off. I was able to trigger the cpu circuit breaker, I think, using this method.
I'm implementing the following application side improvements, and hoping to make a PR soon. (more may be added to the list)
nofollow
to all the facet links and such, that we don't want crawlers/bots to touch. This will obviously only help with the nicer bots that take hints and follow rules, but better than nothing.
This primarily concerns busy instances with large amounts of indexed metadata. Such as the IQSS prod. installation. There's some evidence of Solr becoming the performance bottleneck during high service load. Facet queries are especially suspect. But this has the potential to benefit any busy instance, hence the issue in the main repo.
@kcondon got the ball rolling on this and has already found and applied some useful info. He has started a google doc documenting his work. We may link it here and we will try to document the most useful things we find here (and in the guides, eventually). Kevin's out today, but we had a conversation and decided I'll go ahead an open the issue to make sure this work is scheduled and tracked; striking it while it's hot, etc.