GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
620 stars 98 forks source link

SolrCloud warm standby #3650

Open jbrown-xentity opened 2 years ago

jbrown-xentity commented 2 years ago

User Story

In order to be able to recover from a failed SOLR cloud, data.gov admins want a warm SOLR cloud standby that we could switch to in order to keep the site up.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

The SOLR sync process on a full production instance takes 4 days. We're not sure how long a backup takes to restore, but probably some percentage of that (even if it takes 10% of time, we're talking 10 hours to recover). Since an outage of that time isn't acceptable, we want a SOLR Cloud instance ready to take over if SOLR crashes. Ideally we would use the native backup process from live SOLR, and restore it into the backup regularly (utilizes s3, see documentation here).

We are also considering a different approach, where we have a separate SOLR cluster spun up and synced every 1-4 days; ready to take over if necessary.

Security Considerations (required)

None, data does not leave secure environment

Sketch

Alternate Sketch

jbrown-xentity commented 2 years ago

This may not be necessary pending #3745 working, iceboxing for now.