Closed FuhuXia closed 2 years ago
I'm a little concerned about this change @FuhuXia. Since the leader will send updates to all the followers, if two followers exist accessing the same EFS directory won't the data be duplicated? Wouldn't it be safer to make this a non-rolling restart; aka turn off a single follower instance then turn back on, and iterate through all the followers? We should have enough instances in the short term to handle the load...
I'm a little concerned about this change @FuhuXia. Since the leader will send updates to all the followers, if two followers exist accessing the same EFS directory won't the data be duplicated? Wouldn't it be safer to make this a non-rolling restart; aka turn off a single follower instance then turn back on, and iterate through all the followers? We should have enough instances in the short term to handle the load...
The only command we can issue is to stop a task. We have no control on timing to start the new task. The ECS service handles it.
The risk of the stopped solr still writing to index is low.
Other than using None locktype, we can discuss other options, such as
It seems that if we use SimpleFSLockFactory
, then we can do the no. 2 mentioned above by simply checking the lock file existence. Let me run some local test to verify.
2. find a command line utility to check core lock, start solr only after no lock verified.
======================= ..."The drawback of this implementations is that it might happen when the JV`M holding the lock crashes that you have to manually remove the stale lock file"
After much deliberation, it's been decided that the none
locktype has the minimum risks with the least effort to move forward 🌊
After putting none
locktype into production, we noticed solr index corruption when followers are restarted while replication in progress. Will change to simple
locktype with some retry and file timestamp checking to make it bullet proof. New PR coming.
Related to https://github.com/GSA/data.gov/issues/3920
during task restart, the stopped task might take a while to release the Solr core on EFS before new task can use it.
We often see this error on the newly started SOLR when we stop a SOLR task. When this error happens, the new SOLR is not usable until we stop it again. Changing locktype to
none
should prevent this error from happening. It is simple and safe in our setup. https://cwiki.apache.org/confluence/display/lucene/AvailableLockFactories#