Look into how upgrade handles running processes

keith-turner commented 5 years ago

I think the expectation when someone does an Accumulo upgrade is that all Accumulo processes are killed across the cluster before starting the new version of Accumulo. However, what if anything is done to handle a situation like the following.

Some 1.9.3 tablet servers are running and have metadata tablets assigned to them
- A 2.0.0 master process is started and it starts working on upgrade.

What will happen in this situation? I think ideally the 2.0.0 master process would log an error message about the 1.9.3 tservers, take no upgrade actions, and terminate itself.

keith-turner commented 5 years ago

Looks like I already looked into this, #1139. However I can not remember if the master logs a user friendly error message stating the old processes are preventing upgrade.

cshannon commented 1 year ago

No activity in over 3 years so closing, can be re-opened if still relevant.

EdColeman commented 1 year ago

Seems like this could be checked / added as a follow on to #3098

EdColeman commented 1 year ago

I think we are protected, but maybe we could do more?. I ran a test, going from 2.1.1-SNAPSHOT to 3.0.0-SNAPSHOT using uno and a single tserver.

setup 2.1 and 3.0
started 2.1
killed the 2.1 manager, all other services running
started the 3.0 manager

What happened:

The upgrade failed because it could not talk to the one tserver and had no host for the metadata table. It did upgrade ZooKeeper and root but failed at upgrading metadata (trace below)
The 2.1 tserver remained running, but the manager could not talk to it (AccumuloSecurityException: Error BAD_CREDENTIALS for user !SYSTEM - Username or Password is Invalid) Also, it kept its lock in ZooKeeper
The gc tried to run, but failed when it could not find replication that was removed as part of the upgrade.

So without a "new" tserver and the metadata table, the upgrade only partially succeeded and the system is now in an inconsistent state. If there had been 3.0 tservers, the upgrade would likely have succeeded and the old teserver would not be able to talk with the manager.

Unsure about the gc.

It may be possible to read the table_locks in ZooKeeper and then try to get status from the tservers that are registered there BEFORE starting the upgrade. If that works then we could a) check that the manager can talk to at least one tserver, b) fail the upgrade if any tserver with a registered lock fails a status check, or c) use the min tserver count property to about if there are fewer tservers than specified

There are potential issues with either a) or b) approaches. Starting a cluster with one tserver or a without a substantial portion of tservers might not be the best way to proceed. If ALL tservers are required, then a transient ZooKeeper error could abort the upgrade unnecessarily.

3.0 manager upgrade log

``` 2023-02-09T15:39:25,298 [manager.EventCoordinator] INFO : State changed from INITIAL to HAVE_LOCK 2023-02-09T15:39:25,303 [upgrade.PreUpgradeValidation] INFO : Starting validation on ZooKeeper ACLs 2023-02-09T15:39:25,368 [upgrade.PreUpgradeValidation] INFO : Successfully completed validation on ZooKeeper ACLs 2023-02-09T15:39:25,375 [upgrade.UpgradeCoordinator] INFO : Upgrading Zookeeper - current version 10 as step towards target version 11 2023-02-09T15:39:25,375 [upgrade.Upgrader10to11] INFO : upgrade of ZooKeeper entries 2023-02-09T15:39:25,415 [manager.EventCoordinator] INFO : Upgrade status changed from INITIAL to UPGRADED_ZOOKEEPER 2023-02-09T15:39:25,415 [metrics.MetricsUtil] INFO : initializing metrics, enabled:false, class: 2023-02-09T15:39:25,427 [metrics.MetricsUtil] INFO : Metric producer FateMetrics initialize 2023-02-09T15:39:25,427 [metrics.ManagerMetrics] INFO : Registered FATE metrics module 2023-02-09T15:39:25,437 [manager.EventCoordinator] INFO : State changed from HAVE_LOCK to NORMAL 2023-02-09T15:39:25,438 [upgrade.UpgradeCoordinator] INFO : Upgrading Root - current version 10 as step towards target version 11 2023-02-09T15:39:25,438 [upgrade.Upgrader10to11] INFO : upgrade root - skipping, nothing to do 2023-02-09T15:39:25,439 [manager.EventCoordinator] INFO : Upgrade status changed from UPGRADED_ZOOKEEPER to UPGRADED_ROOT 2023-02-09T15:39:25,439 [upgrade.UpgradeCoordinator] INFO : Upgrading Metadata - current version 10 as step towards target version 11 2023-02-09T15:39:25,439 [upgrade.Upgrader10to11] INFO : upgrade metadata entries 2023-02-09T15:39:25,446 [manager.Manager] INFO : New servers: [localhost:9997[100002204fb0003]] 2023-02-09T15:39:25,447 [manager.EventCoordinator] INFO : There are now 1 tablet servers 2023-02-09T15:39:25,447 [manager.Manager] INFO : tserver availability check disabled, continuing with-1 servers. To enable, set manager.startup.tserver.avail.min.count 2023-02-09T15:39:25,465 [manager.Manager] ERROR: unable to get tablet server status localhost:9997[100002204fb0003] org.apache.thrift.transport.TTransportException: Socket is closed by peer. 2023-02-09T15:39:25,465 [manager.Manager] DEBUG: unable to get tablet server status localhost:9997[100002204fb0003] org.apache.thrift.transport.TTransportException: Socket is closed by peer. at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:176) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:100) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.transport.layered.TFramedTransport.readFrame(TFramedTransport.java:132) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.transport.layered.TFramedTransport.read(TFramedTransport.java:100) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:100) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:622) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:479) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:156) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.accumulo.core.tabletserver.thrift.TabletServerClientService$Client.recv_getTabletServerStatus(TabletServerClientService.java:186) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.tabletserver.thrift.TabletServerClientService$Client.getTabletServerStatus(TabletServerClientService.java:172) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.server.manager.LiveTServerSet$TServerConnection.getTableMap(LiveTServerSet.java:144) ~[accumulo-server-base-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.manager.Manager.lambda$gatherTableInformation$3(Manager.java:981) ~[accumulo-manager-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] ... 2023-02-09T15:39:25,555 [upgrade.UpgradeCoordinator] ERROR: FATAL: Error performing upgrade java.lang.RuntimeException: org.apache.accumulo.core.client.AccumuloSecurityException: Error BAD_CREDENTIALS for user !SYSTEM - Username or Password is Invalid at org.apache.accumulo.core.clientImpl.ScannerIterator.getNextBatch(ScannerIterator.java:183) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ScannerIterator.hasNext(ScannerIterator.java:107) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.manager.upgrade.Upgrader10to11.readReplFilesFromMetadata(Upgrader10to11.java:121) ~[accumulo-manager-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.manager.upgrade.Upgrader10to11.upgradeMetadata(Upgrader10to11.java:111) ~[accumulo-manager-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.manager.upgrade.UpgradeCoordinator.lambda$upgradeMetadata$0(UpgradeCoordinator.java:208) ~[accumulo-manager-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.trace.TraceWrappedCallable.call(TraceWrappedCallable.java:53) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: org.apache.accumulo.core.client.AccumuloSecurityException: Error BAD_CREDENTIALS for user !SYSTEM - Username or Password is Invalid at org.apache.accumulo.core.clientImpl.ThriftScanner.getBatchFromServer(ThriftScanner.java:166) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:108) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:537) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.lookupTabletLocationAndCheckLock(TabletLocatorImpl.java:716) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:701) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:505) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ThriftScanner.scan(ThriftScanner.java:316) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ScannerIterator.readBatch(ScannerIterator.java:154) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ScannerIterator.getNextBatch(ScannerIterator.java:172) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] ... 11 more Caused by: org.apache.accumulo.core.clientImpl.thrift.ThriftSecurityException at org.apache.accumulo.core.tabletscan.thrift.TabletScanClientService$startScan_result$startScan_resultStandardScheme.read(TabletScanClientService.java:4583) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.tabletscan.thrift.TabletScanClientService$startScan_result$startScan_resultStandardScheme.read(TabletScanClientService.java:4559) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.tabletscan.thrift.TabletScanClientService$startScan_result.read(TabletScanClientService.java:4465) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:93) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.accumulo.core.tabletscan.thrift.TabletScanClientService$Client.recv_startScan(TabletScanClientService.java:121) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.tabletscan.thrift.TabletScanClientService$Client.startScan(TabletScanClientService.java:92) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ThriftScanner.getBatchFromServer(ThriftScanner.java:137) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.metadata.MetadataLocationObtainer.lookupTablet(MetadataLocationObtainer.java:108) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.lookupTabletLocation(TabletLocatorImpl.java:537) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.lookupTabletLocationAndCheckLock(TabletLocatorImpl.java:716) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl._locateTablet(TabletLocatorImpl.java:701) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TabletLocatorImpl.locateTablet(TabletLocatorImpl.java:505) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ThriftScanner.scan(ThriftScanner.java:316) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ScannerIterator.readBatch(ScannerIterator.java:154) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] at org.apache.accumulo.core.clientImpl.ScannerIterator.getNextBatch(ScannerIterator.java:172) ~[accumulo-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT] ... 11 more ```

EdColeman commented 1 year ago

When running a 2.1 manager, killing the tserver and then starting a 3.0 tserver the "new" tserver fails to communicate with the manager.

The manager log keeps repeating that it cannot get the status and cannot get it to halt (repeatedly) so it's obvious there is an issue with that tserver.

The gc cannot scan the metadata (Failed to locate tablet for table : +r row : ~del)

2.1 manager log

``` 2023-02-09T17:14:07,477 [manager.Manager] DEBUG: unable to get tablet server status ip-x:9997[100007a41b6000d] org.apache.thrift.TApplicationException: Invalid method name: 'getTabletServerStatus' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:81) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_getTabletServerStatus(TabletClientService.java:596) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.getTabletServerStatus(TabletClientService.java:582) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.server.manager.LiveTServerSet$TServerConnection.getTableMap(LiveTServerSet.java:142) ~[accumulo-server-base-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.manager.Manager.lambda$gatherTableInformation$3(Manager.java:983) ~[accumulo-manager-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] 2023-02-09T17:14:07,477 [manager.Manager] WARN : attempting to stop ip-x:9997[100007a41b6000d] 2023-02-09T17:14:07,478 [manager.Manager] INFO : error talking to troublesome tablet server org.apache.thrift.TApplicationException: Invalid method name: 'halt' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:81) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_halt(TabletClientService.java:682) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.halt(TabletClientService.java:667) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.server.manager.LiveTServerSet$TServerConnection.halt(LiveTServerSet.java:154) ~[accumulo-server-base-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.manager.Manager.lambda$gatherTableInformation$3(Manager.java:1005) ~[accumulo-manager-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] ```

EdColeman commented 1 year ago

With an upgraded 3.0 manager and tserver and the gc running from previous 2.1 instance, the gc fails to run

2.1 gc log

``` 2023-02-09T17:24:47,363 [gc.SimpleGarbageCollector] WARN : Error BAD_CREDENTIALS for user !SYSTEM - Username or Password is Invalid org.apache.accumulo.core.client.AccumuloSecurityException: Error BAD_CREDENTIALS for user !SYSTEM - Username or Password is Invalid at org.apache.accumulo.core.clientImpl.TableOperationsImpl._flush(TableOperationsImpl.java:989) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TableOperationsImpl.flush(TableOperationsImpl.java:831) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.gc.SimpleGarbageCollector.run(SimpleGarbageCollector.java:294) ~[accumulo-gc-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: org.apache.accumulo.core.clientImpl.thrift.ThriftSecurityException at org.apache.accumulo.core.manager.thrift.ManagerClientService$initiateFlush_result$initiateFlush_resultStandardScheme.read(ManagerClientService.java:5244) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.manager.thrift.ManagerClientService$initiateFlush_result$initiateFlush_resultStandardScheme.read(ManagerClientService.java:5221) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.manager.thrift.ManagerClientService$initiateFlush_result.read(ManagerClientService.java:5148) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:93) ~[libthrift-0.17.0.jar:0.17.0] at org.apache.accumulo.core.manager.thrift.ManagerClientService$Client.recv_initiateFlush(ManagerClientService.java:163) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.manager.thrift.ManagerClientService$Client.initiateFlush(ManagerClientService.java:148) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.accumulo.core.clientImpl.TableOperationsImpl._flush(TableOperationsImpl.java:950) ~[accumulo-core-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] ... 4 more ```

ctubbsii commented 1 year ago

With the proposed ServiceLockData abstraction that @dlmarion was working on, we could also serialize the data version into the lock information. It could help going forward, to allow the manager to identify that there are no running tservers on the upgraded version.

The main concern I have from the above investigation is the inconsistent state of upgrading ZooKeeper, but not being able to complete the rest of the upgrade. We should be able to resume and finish the upgrade once a newer tserver is online and hosting the metadata.

EdColeman commented 1 year ago

The restart did work and recovered normally when it had a tserver running the correct version.

Currently it looks like when the master comes up, there are no tservers registered with ZooKeeper in ../table_locks. It may be sufficient on upgrade to drop any locks that are present and allow the tservers to perform reassignment when commanded by the master. More through would be to reach-out and see if they respond to status. Exploring options now.

apache / accumulo

Look into how upgrade handles running processes #1300