Open keith-turner opened 5 years ago
Looks like I already looked into this, #1139. However I can not remember if the master logs a user friendly error message stating the old processes are preventing upgrade.
No activity in over 3 years so closing, can be re-opened if still relevant.
Seems like this could be checked / added as a follow on to #3098
I think we are protected, but maybe we could do more?. I ran a test, going from 2.1.1-SNAPSHOT to 3.0.0-SNAPSHOT using uno and a single tserver.
What happened:
So without a "new" tserver and the metadata table, the upgrade only partially succeeded and the system is now in an inconsistent state. If there had been 3.0 tservers, the upgrade would likely have succeeded and the old teserver would not be able to talk with the manager.
Unsure about the gc.
It may be possible to read the table_locks in ZooKeeper and then try to get status from the tservers that are registered there BEFORE starting the upgrade. If that works then we could a) check that the manager can talk to at least one tserver, b) fail the upgrade if any tserver with a registered lock fails a status check, or c) use the min tserver count property to about if there are fewer tservers than specified
There are potential issues with either a) or b) approaches. Starting a cluster with one tserver or a without a substantial portion of tservers might not be the best way to proceed. If ALL tservers are required, then a transient ZooKeeper error could abort the upgrade unnecessarily.
When running a 2.1 manager, killing the tserver and then starting a 3.0 tserver the "new" tserver fails to communicate with the manager.
The manager log keeps repeating that it cannot get the status and cannot get it to halt (repeatedly) so it's obvious there is an issue with that tserver.
The gc cannot scan the metadata (Failed to locate tablet for table : +r row : ~del)
With an upgraded 3.0 manager and tserver and the gc running from previous 2.1 instance, the gc fails to run
With the proposed ServiceLockData abstraction that @dlmarion was working on, we could also serialize the data version into the lock information. It could help going forward, to allow the manager to identify that there are no running tservers on the upgraded version.
The main concern I have from the above investigation is the inconsistent state of upgrading ZooKeeper, but not being able to complete the rest of the upgrade. We should be able to resume and finish the upgrade once a newer tserver is online and hosting the metadata.
The restart did work and recovered normally when it had a tserver running the correct version.
Currently it looks like when the master comes up, there are no tservers registered with ZooKeeper in ../table_locks. It may be sufficient on upgrade to drop any locks that are present and allow the tservers to perform reassignment when commanded by the master. More through would be to reach-out and see if they respond to status. Exploring options now.
I think the expectation when someone does an Accumulo upgrade is that all Accumulo processes are killed across the cluster before starting the new version of Accumulo. However, what if anything is done to handle a situation like the following.
What will happen in this situation? I think ideally the 2.0.0 master process would log an error message about the 1.9.3 tservers, take no upgrade actions, and terminate itself.