DataONEorg / dataone-indexer

DataONE Indexer subsystem
Apache License 2.0
0 stars 2 forks source link

Solr Schema Upgrades - how to handle in Indexer? #52

Open artntek opened 1 year ago

artntek commented 1 year ago

We need to decide upon and implement how we handle solr schema upgrades and their associated reindex actions.

mbjones commented 12 months ago

In the past, we have often upgraded a solr schema.xml file without reindexing the entire corpus, especially for DataONE that has millions of versions of documents. That said, SOLR really recommends against this, particularly for major version upgrades. They discuss this here, specifically dealing with schema changes, solrconfig changes, and version upgrades: https://solr.apache.org/guide/solr/latest/indexing-guide/reindexing.html

One mechanism to avoid downtime is to reindex to a new SOLR collection, so that the old index continues to exist. Once the new collection is available, then we can use a collection alias to atomically switch the service from the old index content to the new content.

When thinking about a SOLR cloud update, it would be good to contemplate a rolling update strategy where 1) existing SOLR pods continue operating and serving an existing collection index, probably in read-only mode; 2) a new version of the service is rolled out, and the new PODS start up and begin reindexing the content into a new collection; 3) when the reindex is complete, the old pods are brought down, and the new pods begin serving requests, possibly after renaming with a collection alias. Extra bonus points for tying this seamlessly into Kubernetes rolling updates in a way that permits rollback to the original pods and indexed collection if for some reason the upgrade is not successful.

Of course, this is an ideal world -- we can manually manage this transition as well if such a set of features would significantly delay release.

mbjones commented 12 months ago

Related background (but with elasticsearch): https://developers.soundcloud.com/blog/how-to-reindex-1-billion-documents-in-1-hour-at-soundcloud

artntek commented 11 months ago

Notes from related discussion in ESS-DIVE meeting: could use a Sidecar container - checks periodically for new index file, then triggers reindex if changes detected

artntek commented 11 months ago

This is important for 3.0.0 release, since that release involves a schema upgrade

artntek commented 9 months ago

Example

For 3.0.0 release, this will be manual, and we can choose not to reindex for huge corpuses. No end-user impact other than not being able to access that new info in metacatui (eg new license field)

artntek commented 9 months ago

Manage manually for 3.0; automate for 3.1