apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.08k stars 445 forks source link

Adds new accumulo command #5073

Open kevinrr888 opened 1 week ago

kevinrr888 commented 1 week ago

closes #5034

I did a deep dive into the code executed by accumulo check-server-config to figure out what is being checked to see if an instance can be avoided or what could be checked without an instance. Here is root code for reference: https://github.com/apache/accumulo/blob/ed4161ea21dd9b7293be0ca72bc3651d65513440/server/base/src/main/java/org/apache/accumulo/server/conf/CheckServerConfig.java#L30-L34

Here is a summary of my findings:

With this info, I wrote a new check CheckServerConfigNoInstance which, to the best of my knowledge, is all the validation that the current CheckServerConfig would be able to do without a running instance. So, this PR adds a new accumulo check-server-config-no-inst (in addition to the existing accumulo check-server-config) which performs the checks that accumulo check-server-config does without the checks that require a running Accumulo instance. Useful for (at least mostly) validating the accumulo.properties file without an instance.

I'm not sure which branch this should target. Targeting 2.1 for now since the change could be made as early as 2.1 and it may be a desireable command here, but not sure if we would want to add a new command in the 2.1 line.

Any input/suggestions on this PR would be appreciated.

dlmarion commented 1 week ago

@kevinrr888 - wondering if we could just run the no-instance checks if the error is thrown when an instance doesn't exist so that we can re-use the command? Thoughts?

kevinrr888 commented 1 week ago

@kevinrr888 - wondering if we could just run the no-instance checks if the error is thrown when an instance doesn't exist so that we can re-use the command? Thoughts?

I considered that, but thought it might hide some issues. For example, if a user expects that they have a running instance and run accumulo check-server-config without a running instance, it would do less checks than may be expected and would not result in any no-instance errors. To me, it seemed safest to split these into two separate commands and keep the no-instance errors of the original command.

dlmarion commented 1 week ago

I was thinking a stern warning could printed to the screen that no instance is running so we are doing a reduced set of checks. I assume someone is going to run this interactively and not via a script. If we don't want to make that assumption, we could return a non-zero code.

kevinrr888 commented 1 week ago

I was thinking a stern warning could printed to the screen that no instance is running so we are doing a reduced set of checks. I assume someone is going to run this interactively and not via a script. If we don't want to make that assumption, we could return a non-zero code.

Yeah, I think a warning like that would be a way to replace the extra command. However, another thing I'm thinking of is it might be tricky to identify what a no-instance exception is. There is a certain exception that currently occurs, but that exception may change.

dlmarion commented 1 week ago

We could just raise our own exception, NoInstanceException, for example.

kevinrr888 commented 1 week ago

I removed the new check and now check if an instance exists before the full set of checks runs in CheckServerConfig. The instance is checked with VolumeManager.getInstanceIDFromHdfs(instanceIdPath, hadoopConfig). Coincidentally, the vars needed for this call perform all* the checks that can be performed without a running instance: SiteConfiguration.auto(), VolumeManagerImpl.get(siteConfig, hadoopConfig), and new ServerDirs(siteConfig, hadoopConfig).

I wanted to avoid just surrounding the existing code:

    try (var context = new ServerContext(SiteConfiguration.auto())) {
      context.getConfiguration();
    }

with a try-catch as the exception that occurs is a few layers deep and may be changed in the future and it would not be clear that changing those exceptions would affect this command. Also, if we surrounded it with a try-catch we would have to identify what is an exception relating to no instance from many potential exceptions. Calling the method directly avoids these issues.

*to the best of my knowledge

kevinrr888 commented 1 week ago

One issue with this current impl is that the stack trace will still show an exception (this time a ConnectException) even though the calling code handles all Exceptions. My guess is that the exception is occurring in another thread. So, an output of accumulo check-server-config would look like this with no instance running:

2024-11-19T12:16:25,916 [conf.SiteConfiguration] INFO : Found Accumulo configuration on classpath at /home/krathbun/Desktop/github/fluo-uno/install/accumulo-2.1.4-SNAPSHOT/conf/accumulo.properties
2024-11-19T12:16:26,300 [fs.VolumeManager] ERROR: Problem reading instance id out of hdfs at hdfs://localhost:8020/accumulo/instance_id
java.net.ConnectException: Call From groot/10.0.0.182 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) ~[?:?]
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:845) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client.call(Client.java:1513) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client.call(Client.java:1410) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) ~[hadoop-client-api-3.3.6.jar:?]
    at jdk.proxy2/jdk.proxy2.$Proxy30.getListing(Unknown Source) ~[?:?]
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:689) ~[hadoop-client-api-3.3.6.jar:?]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) ~[hadoop-client-api-3.3.6.jar:?]
    at jdk.proxy2/jdk.proxy2.$Proxy31.getListing(Unknown Source) ~[?:?]
    at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1702) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1686) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1113) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:149) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1188) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1185) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1195) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.accumulo.server.fs.VolumeManager.getInstanceIDFromHdfs(VolumeManager.java:214) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
    at org.apache.accumulo.server.conf.CheckServerConfig.main(CheckServerConfig.java:54) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
    at org.apache.accumulo.server.conf.CheckServerConfig.execute(CheckServerConfig.java:78) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
    at org.apache.accumulo.start.Main.lambda$execKeyword$0(Main.java:122) ~[accumulo-start-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
    at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
    at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:205) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:600) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:652) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:773) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:347) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1632) ~[hadoop-client-api-3.3.6.jar:?]
    at org.apache.hadoop.ipc.Client.call(Client.java:1457) ~[hadoop-client-api-3.3.6.jar:?]
    ... 28 more
2024-11-19T12:16:26,300 [conf.CheckServerConfig] WARN : Performed only a subset of checks (which passed). There is a problem relating to the instance:
Can't tell if Accumulo is initialized; can't read instance id at hdfs://localhost:8020/accumulo/instance_id
and no further checks could be done. If this is unexpected, make sure an instance is running and re-run the command.

The subset of the checks have still occurred and passed (noted in the output), but the exception unfortunately makes the info less visible and it might be better if the exception was not seen here.

dlmarion commented 1 week ago

Regarding the exception being printed, that might be due to the logging configuration for the client (log4j2.properties). If that is set to direct output to a file, then you likely would not see that. For that reason, any output that we have from the command may want to use System.out or System.err

kevinrr888 commented 1 week ago

Gotcha. Changed to print the warning. That logged error from my running of it was my only concern with this, but if it's fine to you then I have no problems with it.