GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
567 stars 89 forks source link

O+M Solr Search Response Time #4608

Open gujral-rei opened 5 months ago

gujral-rei commented 5 months ago

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Weekly Checklist

Monthly Checklist

ad-hoc checklist

Reference

gujral-rei commented 5 months ago

Per New Relic, the SOLR search takes around 6 seconds on average. The number of requests stays constant at around two requests per minute. I'm not sure if it's typical for SOLR to take this long, irrespective of the type of transaction.

https://one.newrelic.com/nr1-core/apm-features/transactions/MTYwMTM2N3xBUE18QVBQTElDQVRJT058MTAwMzAxMDk1Ng?account=1601367&duration=259200000&state=ecca532c-8046-2ce7-562f-35b935e08ac7

Image

gujral-rei commented 5 months ago

I re-checked with Fuhu, and the average response time of 6 to 8 seconds is not normal. The image and link below give more details.

Below is the link: https://onenr.io/02R5zLdl0jb

Image

gujral-rei commented 5 months ago

On average, the SOLR servers are processing the requests with 200 to 300 milliseconds. However, considering the above analysis and the analysis below, the time taken to parse the SOLR response from the SOLR server is likely the bottleneck. When we look at it from a Transaction standpoint (total time including a trip to the SOLR server), the SOLR transaction takes 5 seconds on average. When we look at it from the SOLR DB/Servers standpoint, it is taking around 200 to 300 ms. So, where did 4.5 seconds go?

Average SELECT average(apm.service.datastore.operation.duration * 1000) FROM Metric WHERE (entity.guid = 'MTYwMTM2N3xBUE18QVBQTElDQVRJT058MTAwMzAxMDk1Ng') AND ((db.system = 'Solr')) FACET db.system LIMIT 5 SINCE 1 week AGO TIMESERIES

Max SELECT max(apm.service.datastore.operation.duration * 1000) FROM Metric WHERE (entity.guid = 'MTYwMTM2N3xBUE18QVBQTElDQVRJT058MTAwMzAxMDk1Ng') AND ((db.system = 'Solr')) FACET db.system LIMIT 5 SINCE 1 week AGO TIMESERIES

Image

Image

gujral-rei commented 5 months ago

I continued looking for why the transactions were taking slowly and narrowed it down to the fact that we call SOLR DB up to 20 times in one web request. While one SOLR request may take between 200 to 300 milliseconds (this can be further improved), calling it 20 times in one request makes it around 5 seconds on average. We have to look at the code block and see if there is a for loop that can be optimized.

https://onenr.io/0oR8G5dv3wG

Image

FuhuXia commented 5 months ago

That is good finding. 20+ solr transactions in one request makes sense, since ckan lists 20 datasets per pagination. These 20 datasets can be loaded with one solr transaction, not 20 transactions.

jbrown-xentity commented 5 months ago

That is a crazy number. I would be surprised if we're getting the datasets one at a time; I think more likely it's related to the filters on the left side. Each filter needs to scan the entire catalog and summarize the numbers for each thing (21,378 datasets are in the Local Government Topic, for instance). These 9 categories, called each time the catalog search loads, probably contributes to the slowdown...

FuhuXia commented 5 months ago

@jbrown-xentity the proof to support the slowness iis due to 20 solr calls for 20 datasets on one page, not facet queries, is that on pages with less dataset listed, such as last page, or some search result returns few datasets, the load time is much less. still same amount of facet queries, but less datasets. I think all the facets query is done within one call.

btylerburton commented 5 months ago

Great findings here @gujral-rei. If pagination is controlled by a config we could change the number and confirm.

@FuhuXia do you think there's low hanging fruit here to reduce those calls to a single transaction? If it was that easy, I'd think CKAN core would've been optimized already.

FuhuXia commented 5 months ago

We can confirm the findings and then file an issue to CKAN core. It's not a problem for instances with Solr sitting right next to it. It's for us where Solr is across many infrastructure spaghetti cables.

btylerburton commented 4 months ago

@FuhuXia did the upgrade to 2.10.4 solve this issue?

FuhuXia commented 4 months ago

No. But during the testing I confirmed on my local that there is only one solr call for each page listing, not 20. So we need to look more into the 20-time solr call finding.