IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

Investigate solr performance, diagnose bottlenecks and develop optimizations #9635

Closed landreev closed 9 months ago

landreev commented 1 year ago

This primarily concerns busy instances with large amounts of indexed metadata. Such as the IQSS prod. installation. There's some evidence of Solr becoming the performance bottleneck during high service load. Facet queries are especially suspect. But this has the potential to benefit any busy instance, hence the issue in the main repo.

@kcondon got the ball rolling on this and has already found and applied some useful info. He has started a google doc documenting his work. We may link it here and we will try to document the most useful things we find here (and in the guides, eventually). Kevin's out today, but we had a conversation and decided I'll go ahead an open the issue to make sure this work is scheduled and tracked; striking it while it's hot, etc.

landreev commented 1 year ago

Current plan of investigation (much simplified; per slack discussion):

(more to be added; work in progress)

cmbz commented 1 year ago
luddaniel commented 1 year ago

Related issue #8941 :)

landreev commented 1 year ago

There's another low hanging fruit-like improvement that can be handled as a compact, self-contained issue: Inside SearchIncludeFragment (the class file that runs the search for the collection page - i.e., where most searches issued by Dataverse originate) every search is literally run twice, in order to obtain the result counts for the object types NOT currently selected by the user. While the second search is necessary in order to obtain these count(s), it can be run in a more economical way, that can reduce the amount of work for solr to perform by up to 50%. The details are spelled out in a #dv-tech thread and can be copy-and-pasted into an issue once we open it.

landreev commented 1 year ago

@luddaniel I just looked at #8941. If it can be addressed by simply indexing fileCount as another numeric field during indexing, that does sound very straightforward, and I don't know why we haven't implemented it yet/have overlooked it. Generally, we absolutely need to reduce the number of SQL lookups that are required when we process lists of search results as they come from Solr. Hopefully we can address it soon.

landreev commented 11 months ago

We had a series of prod. auto-restarts again yesterday. This is definitely caused by solr stopping responding, this is confirmed now. A bandaid, but hopefully an effective temp. workaround was applied on the solr side, setting up "circuit breaker" config options. This makes solr start dropping new connections with 503 when certain threshold of memory and/or cpu utilization is reached, rather than hanging. (the observed search page freezes were accompanied by the cpu utilization approaching 800% - on the 8-cpu node that is dvn-cloud-solr). Thanks/kudos to @kcondon for researching/figuring out the above.

kcondon commented 11 months ago

Also noticed that clicking on basic search with no parameters spikes solr cpu. Clicking multiple times raises cpu use rapidly to where I can simulate the 700+% cpu load on solr. Refreshing the homepage or clicking on facets does not do this beyond an initial spike, then levels off. I was able to trigger the cpu circuit breaker, I think, using this method.

landreev commented 11 months ago

I'm implementing the following application side improvements, and hoping to make a PR soon. (more may be added to the list)