GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
603 stars 96 forks source link

Dissect Solr Performance through New Relic #3956

Open nickumia-reisys opened 2 years ago

nickumia-reisys commented 2 years ago

User Story

In order to gain insight into why Solr has stability issues, the Data.gov Solr team wants to integrate NR into our Solr deployment and investigate performance metrics to isolate problem areas.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

There have been numerous issues with Solr where we could not identify the cause of the problem and were developing blindly. This issue would give us insight into what function calls or Solr operations are causing various problems and help us identify which parameters should be tuned for that particular optimization.

Historical Issues:

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

Reference: https://tech.olx.com/improving-solr-performance-f4202d28b72d

nickumia-reisys commented 12 months ago

Through an interactive discussion with NR Support, it was determined that there are solr optimizations we can do:

  1. Speed Optimization image
    1. Number of Solr Calls
      • For a homepage load, there are 21 calls to Solr. This should not need to be more than 2. In other words, the CKAN core code is inefficient. If we wanted to optimize this, it would be a daring endeavor to not break a code feature of CKAN. Serious thought would be needed for this effort.
    2. Speed of Solr Calls
      • Each Solr call takes around 1s on average. This means our Solr deployment is pretty inefficient. For context, a ("similar"?) DB call is made 140 times, but takes less than 100ms for all of those call cumulatively. We probably won't be able to match the DB speed. However, the performance-to-cost ratio may or may not be worth increasing the size of the Solr instance. We are giving 4 vCPU and AWS has support for upto 16 vCPU. image
  2. Better Performance Monitoring
    1. See https://github.com/GSA/data.gov/issues/4473 for more details.

These points don't talk to why there is a (memory leak?) in Solr and what we can do to resolve that as yet. The second point will allow us to do more debugging.

btylerburton commented 9 months ago

Depends upon this ticket: https://github.com/GSA/data.gov/issues/3956