apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

tserver could not complete shutdown #3368

Closed ivakegg closed 1 year ago

ivakegg commented 1 year ago

Doing a cluster shutdown, there was one tserver that would not complete. Its log was spamming the following: [tablet.Tablet] DEBUG: Waiting to completed Close for !0;.... , 0 writes 21 scans [clientImp.ThriftScanner] DEBUG: Error getting transport to node:9997 : null

This was simply repeating, and the <node> in the error message was itself!

Accumulo 2.1.1-SNAPSHOT CentOS 7.3

ctubbsii commented 1 year ago

@ivakegg If using a snapshot, can you provide the commit used? If not sure, the built jars probably have the commit in their manifest.

cshannon commented 1 year ago

Looks like this is the loop that it's stuck on waiting on for the scans to finish. And the log message is being printed in ThriftScanner unfortunately isn't logging the full stack trace.

@ivakegg - It seems like the message portion of the exception is null. Since you are using a SNAPSHOT build (and if you can reproduce the issue) I would update the logging here to log the full stack trace and try and log the TException being caught.

ctubbsii commented 1 year ago

@cshannon The interrupted exception on line https://github.com/apache/accumulo/blob/b41427dc18c9fa36c9e619ebf928a02aebeb22fb/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java#LL987C33-L987C33 is also poorly formatted. It should be log.error("{}", e, e); instead of log.error(e.toString());

Regardless of what is discovered from troubleshooting this, that logging should be fixed.

There may be cases where we omit the full stack trace for some rational reason, but those cases should have comments in the code explaining the justification.

ctubbsii commented 1 year ago

I'm marking this as a blocker to fix the logging for 2.1.1 at a minimum.

cshannon commented 1 year ago

Good catch on the logging issue, I will at least fix that now.

ivakegg commented 1 year ago

With the updated logging, I believe we can close this issue for now pending reproducing this error. I have only seen it once thus far and we have been bounced the system several times since.

ivakegg commented 1 year ago

We appear to be able to reproduce this fairly easily on one of our test systems. Reopening and will add details.

EdColeman commented 1 year ago

Can you confirm that the same host name / fqdn is used consistently in all places?

dtspence commented 1 year ago

@EdColeman Yes, we will check.

EDIT: The nodes are using a short-hostname. There is evidence in the cluster that FQDN hostnames have been used on t-servers in the past two weeks - however, all currently are using the short hostnames.

dtspence commented 1 year ago

When attempting to stop a single t-server (i.e. bin/accumulo-cluster stop-here) on a test system (Cluster-1) the command-line stop hung and the t-server continued to receive assignments. The monitor showed the t-server assignment number cycle up/down. From the logs, the manager appeared to assign tablets back to the t-server being stopped. We attempted to reproduce the issue on a second test cluster (Cluster-2) and the t-server shutdown as expected. However, we were not sure if the assignment (Cluster-2) which occurs after the t-server shutdown is expected or un-expected (expanded below).

For the purposes of below, Cluster-1 will be describing the t-servers with the shutdown issue and Cluster-2 will be the other cluster.

We currently have noticed two configuration differences between systems:

The following was observed on Cluster-1 (t-server hanging during shutdown):

- Seeding FATE[...] Shutdown tserver <shutdown-host:port> [...]
- Tablet Server shutdown requested for <shutdown-host:pot> [...]
- tablet <tid;begin;end> was unloaded from <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- ...
- Sending 1 tablets to balancer for table accumulo.metadata for assignment within t-servers [..., <shutdown-host:port>, ...]
- Assigning 1 tablets
- Assigned !0,~del... to <shutdown-host:port> [....]
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet <!0,~del> was loaded on <shutdown-host:port> [...]
- tablet ... was unloaded on <shutdown-host:port> [...]
- Sending 14 tablets to balancer for table <application-table> for assignment within t-servers [..., <shutdown-host:port>, ...]
- ...
- Assigning XXXX tablets
- ...

The Cluster-2 logs reflect the following:

- tablet ... was unloaded on <shutdown-host:port> [...]
- ...
- tablet ... was unloaded on <shutdown-host:port> [...]
- tablet server hosts no tablets <shutdown-host:port> [...]
- unable to get tablet server status <shutdown-host:port> [...]
- IOException: No connection to <shutdown-host:port> [...]
- not balancing because the balance information is out of date
- not balancing just yet, as collection of live tservers is in flux
- Sending X tablets to balancer for table <user-table> for assignment within t-servers  [..., <shutdown-host:port>, ...]
- Assigned <user-tablet-1> to  <shutdown-host:port>
- ...
- Assigned <user-tablet-n> to  <shutdown-host:port> [...]
- Could not connect to server <shutdown-host:port> [...]
- ....
- Could not connect to server <shutdown-host:port> [...]
- [Normal Tablets]: 10237 tablets are SUSPENDED
- Detected change in current tserver set, re-running state machine.
- not balancing because there are unhosted tablets: 10237
- 9597 assigned to dead servers [<user-tablet-1@(Location [server=<shutdown-host:port>, type=FUTURE],null,Location[server=<shutdown-host:port>,type=LAST]),.....
- Suspended <user-tablet-1> to <shutdown-host:port> at <> with 1 walogs
- Sending X tablets to balancer for <user-table> for assignment within tservers [...]   
- balancer assigned <user-tablet-> to <host-a:port> which is not the suggested location of <shutdown-host:port>

The Cluster-2 logs above show the t-server shutdown after the tablet unload. However, we were not sure if the assignment after the unload is expected or un-expected (and would be similar to Cluster-1). The assigned tablets are sent to the shutdown t-server (and the list of candidate t-servers contains the shutdown t-server). The manager observes the server is unreachable and after tablet suspension, the candidate t-server in the next assignment list no longer includes the shutdown t-server.

dtspence commented 1 year ago

Attempted to replicate with an integration test using MiniAccumulo cluster. Additionally reviewed source related to tablet assignment after t-server shutdown initiated. There may be a change between 1.10 and 2.1 in the TabletGroupWatcher that impacted the t-servers sent to the balancer for assignment.

The list of servers being shutdown is used by the TabletGroupWatcher to create a list of destinations: https://github.com/apache/accumulo/blob/2.1/server/manager/src/main/java/org/apache/accumulo/manager/TabletGroupWatcher.java#L159

The flushChanges() method attempts to get assignments from the balancer. The t-server set is the current list (which includes the list of t-servers being shutdown). https://github.com/apache/accumulo/blob/2.1/server/manager/src/main/java/org/apache/accumulo/manager/TabletGroupWatcher.java#L910

In previous versions (i.e. v1.10) the list passed to flushChanges() appears to have been the destinations which may have excluded the t-servers being shutdown. https://github.com/apache/accumulo/blob/1.10/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L214 https://github.com/apache/accumulo/blob/1.10/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L361 https://github.com/apache/accumulo/blob/1.10/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L875

Continuing to investigate.

dlmarion commented 1 year ago

EDIT: The nodes are using a short-hostname. There is evidence in the cluster that FQDN hostnames have been used on t-servers in the past two weeks - however, all currently are using the short hostnames.

Did you also look in zookeeper?

dtspence commented 1 year ago

EDIT: The nodes are using a short-hostname. There is evidence in the cluster that FQDN hostnames have been used on t-servers in the past two weeks - however, all currently are using the short hostnames.

Did you also look in zookeeper?

I had not previously looked in zookeeper. I checked and zookeeper reports short-hostnames, with one exception. The replication/tservers location references FQDN names.

dlmarion commented 1 year ago

So, it sounds like the Manager is re-assigning the metadata table tablets immediately, even to a TabletServer that is shutting down. What is manager.metadata.suspendable set to?

ctubbsii commented 1 year ago

I'm going to bump this to 2.1.2, in anticipation of a 2.1.1 soon. I don't think this needs to be a blocker anymore. But if we get a change ready to address this before 2.1.1 release candidates are created, we can include it in 2.1.1 instead of pushing off until 2.1.2.

dtspence commented 1 year ago

So, it sounds like the Manager is re-assigning the metadata table tablets immediately, even to a TabletServer that is shutting down. What is manager.metadata.suspendable set to?

The value is set to "false" on systems we are using for testing.

ivakegg commented 1 year ago

We have seen this issue again. Don't know if it is the same reason. These are tservers that are exclusively handling accumulo.metadata tablets. They were stuck spamming the following messages:

SessionManager INFO : Closing idle session from user=!SYSTEM, client=<manager node>
Tablet DEBUG: Waiting to completeClose for ..., 0 writes 1 scans

This was with ACCUMULO 2.1.1-RC2

ivakegg commented 1 year ago

We have jstacks and logs captured for anybody that wants to come by and look (test system, manager node)

ctubbsii commented 1 year ago

A change was already released to address this in 2.1.1 (what was 2.1.1-RC2). Rather than introduce a bunch of confusion over the target/fix versions, I'm (re-)closing this, and creating a new issue, #3512