GSA-TTS / datagov-brokerpak-solr

An OSBAPI brokerpak that supplies services needed by the datagov team
Other
8 stars 5 forks source link

fix: change solr locktype to none #53

Closed FuhuXia closed 2 years ago

FuhuXia commented 2 years ago

Related to https://github.com/GSA/data.gov/issues/3920

during task restart, the stopped task might take a while to release the Solr core on EFS before new task can use it.

We often see this error on the newly started SOLR when we stop a SOLR task. When this error happens, the new SOLR is not usable until we stop it again. Changing locktype to none should prevent this error from happening. It is simple and safe in our setup. https://cwiki.apache.org/confluence/display/lucene/AvailableLockFactories#

SolrCore Initialization Failures
ckan: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Index dir '/var/solr/data/ckan/data/index/' of core 'ckan' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native
Please check your logs for more information
jbrown-xentity commented 2 years ago

I'm a little concerned about this change @FuhuXia. Since the leader will send updates to all the followers, if two followers exist accessing the same EFS directory won't the data be duplicated? Wouldn't it be safer to make this a non-rolling restart; aka turn off a single follower instance then turn back on, and iterate through all the followers? We should have enough instances in the short term to handle the load...

FuhuXia commented 2 years ago

I'm a little concerned about this change @FuhuXia. Since the leader will send updates to all the followers, if two followers exist accessing the same EFS directory won't the data be duplicated? Wouldn't it be safer to make this a non-rolling restart; aka turn off a single follower instance then turn back on, and iterate through all the followers? We should have enough instances in the short term to handle the load...

The only command we can issue is to stop a task. We have no control on timing to start the new task. The ECS service handles it.

The risk of the stopped solr still writing to index is low.

Other than using None locktype, we can discuss other options, such as

  1. adding sleep time before command solr start.
  2. find a command line utility to check core lock, start solr only after no lock verified.
  3. change the healthcheck to include core check, not just solr service up
  4. start new task with replication disabled to avoid index corrupt., then call api to start replication.
FuhuXia commented 2 years ago

It seems that if we use SimpleFSLockFactory, then we can do the no. 2 mentioned above by simply checking the lock file existence. Let me run some local test to verify.

2. find a command line utility to check core lock, start solr only after no lock verified.

======================= ..."The drawback of this implementations is that it might happen when the JV`M holding the lock crashes that you have to manually remove the stale lock file"

nickumia-reisys commented 2 years ago

After much deliberation, it's been decided that the none locktype has the minimum risks with the least effort to move forward 🌊

FuhuXia commented 2 years ago

After putting none locktype into production, we noticed solr index corruption when followers are restarted while replication in progress. Will change to simple locktype with some retry and file timestamp checking to make it bullet proof. New PR coming.