Open kevinrr888 opened 1 week ago
@kevinrr888 - wondering if we could just run the no-instance checks if the error is thrown when an instance doesn't exist so that we can re-use the command? Thoughts?
@kevinrr888 - wondering if we could just run the no-instance checks if the error is thrown when an instance doesn't exist so that we can re-use the command? Thoughts?
I considered that, but thought it might hide some issues. For example, if a user expects that they have a running instance and run accumulo check-server-config
without a running instance, it would do less checks than may be expected and would not result in any no-instance errors. To me, it seemed safest to split these into two separate commands and keep the no-instance errors of the original command.
I was thinking a stern warning could printed to the screen that no instance is running so we are doing a reduced set of checks. I assume someone is going to run this interactively and not via a script. If we don't want to make that assumption, we could return a non-zero code.
I was thinking a stern warning could printed to the screen that no instance is running so we are doing a reduced set of checks. I assume someone is going to run this interactively and not via a script. If we don't want to make that assumption, we could return a non-zero code.
Yeah, I think a warning like that would be a way to replace the extra command. However, another thing I'm thinking of is it might be tricky to identify what a no-instance exception is. There is a certain exception that currently occurs, but that exception may change.
We could just raise our own exception, NoInstanceException, for example.
I removed the new check and now check if an instance exists before the full set of checks runs in CheckServerConfig
. The instance is checked with VolumeManager.getInstanceIDFromHdfs(instanceIdPath, hadoopConfig)
. Coincidentally, the vars needed for this call perform all* the checks that can be performed without a running instance: SiteConfiguration.auto()
, VolumeManagerImpl.get(siteConfig, hadoopConfig)
, and new ServerDirs(siteConfig, hadoopConfig)
.
I wanted to avoid just surrounding the existing code:
try (var context = new ServerContext(SiteConfiguration.auto())) {
context.getConfiguration();
}
with a try-catch
as the exception that occurs is a few layers deep and may be changed in the future and it would not be clear that changing those exceptions would affect this command. Also, if we surrounded it with a try-catch
we would have to identify what is an exception relating to no instance from many potential exceptions. Calling the method directly avoids these issues.
*to the best of my knowledge
One issue with this current impl is that the stack trace will still show an exception (this time a ConnectException
) even though the calling code handles all Exception
s. My guess is that the exception is occurring in another thread. So, an output of accumulo check-server-config
would look like this with no instance running:
2024-11-19T12:16:25,916 [conf.SiteConfiguration] INFO : Found Accumulo configuration on classpath at /home/krathbun/Desktop/github/fluo-uno/install/accumulo-2.1.4-SNAPSHOT/conf/accumulo.properties
2024-11-19T12:16:26,300 [fs.VolumeManager] ERROR: Problem reading instance id out of hdfs at hdfs://localhost:8020/accumulo/instance_id
java.net.ConnectException: Call From groot/10.0.0.182 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) ~[?:?]
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) ~[?:?]
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:930) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:845) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1571) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1513) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1410) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) ~[hadoop-client-api-3.3.6.jar:?]
at jdk.proxy2/jdk.proxy2.$Proxy30.getListing(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:689) ~[hadoop-client-api-3.3.6.jar:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Method.java:569) ~[?:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) ~[hadoop-client-api-3.3.6.jar:?]
at jdk.proxy2/jdk.proxy2.$Proxy31.getListing(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1702) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1686) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1113) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:149) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1188) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1185) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1195) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.accumulo.server.fs.VolumeManager.getInstanceIDFromHdfs(VolumeManager.java:214) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
at org.apache.accumulo.server.conf.CheckServerConfig.main(CheckServerConfig.java:54) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
at org.apache.accumulo.server.conf.CheckServerConfig.execute(CheckServerConfig.java:78) ~[accumulo-server-base-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
at org.apache.accumulo.start.Main.lambda$execKeyword$0(Main.java:122) ~[accumulo-start-2.1.4-SNAPSHOT.jar:2.1.4-SNAPSHOT]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:205) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:600) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:652) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:773) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:347) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1632) ~[hadoop-client-api-3.3.6.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1457) ~[hadoop-client-api-3.3.6.jar:?]
... 28 more
2024-11-19T12:16:26,300 [conf.CheckServerConfig] WARN : Performed only a subset of checks (which passed). There is a problem relating to the instance:
Can't tell if Accumulo is initialized; can't read instance id at hdfs://localhost:8020/accumulo/instance_id
and no further checks could be done. If this is unexpected, make sure an instance is running and re-run the command.
The subset of the checks have still occurred and passed (noted in the output), but the exception unfortunately makes the info less visible and it might be better if the exception was not seen here.
Regarding the exception being printed, that might be due to the logging configuration for the client (log4j2.properties). If that is set to direct output to a file, then you likely would not see that. For that reason, any output that we have from the command may want to use System.out or System.err
Gotcha. Changed to print the warning. That logged error from my running of it was my only concern with this, but if it's fine to you then I have no problems with it.
closes #5034
I did a deep dive into the code executed by
accumulo check-server-config
to figure out what is being checked to see if an instance can be avoided or what could be checked without an instance. Here is root code for reference: https://github.com/apache/accumulo/blob/ed4161ea21dd9b7293be0ca72bc3651d65513440/server/base/src/main/java/org/apache/accumulo/server/conf/CheckServerConfig.java#L30-L34Here is a summary of my findings:
SiteConfiguration.auto()
does validation: checks that the deprecatedaccumulo-site.xml
file doesn't exist, loads properties from the config file (accumulo.properties
), checks that both deprecated and the replacement properties aren't both declared in the file, replaces deprecated properties with their non-deprecated versions, and returns a new SiteConfiguration object (the SiteConfiguration object constructor callsConfigCheckUtil.validate(config)
to validate the config passed in)context.getConfiguration()
does validation. However, as far as I can tell, the only validation done from this call requires an instance (the only validation seems to stem fromZooBasedConfiguration
/PropSnapshot
/PropStore
/PropStoreKey
/etc.). Validates at least theinstance.secret
property (maybe more). Since this is related to an instance, seems fine that it wouldn't be validated here anyways.new ServerContext(SiteConfiguration.auto())
(=new ServerContext(new ServerInfo(SiteConfiguration.auto()))
) does validation. This creates anew ServerInfo(siteConfig)
passing that to theServerContext
constructor. There is no validation done in theServerContext
constructor: most fields are memoized suppliers (the only one we access isServerConfigurationFactory serverConfFactory
which is accessed throughcontext.getConfiguration()
and we went into that code above). The ones that aren't memoized suppliers don't do any validation. But there is validation done from constructingServerInfo
. We will take what we can (does some form of validation and is not reliant on an instance) and use that in our new check.With this info, I wrote a new check
CheckServerConfigNoInstance
which, to the best of my knowledge, is all the validation that the currentCheckServerConfig
would be able to do without a running instance. So, this PR adds a newaccumulo check-server-config-no-inst
(in addition to the existingaccumulo check-server-config
) which performs the checks thataccumulo check-server-config
does without the checks that require a running Accumulo instance. Useful for (at least mostly) validating theaccumulo.properties
file without an instance.I'm not sure which branch this should target. Targeting 2.1 for now since the change could be made as early as 2.1 and it may be a desireable command here, but not sure if we would want to add a new command in the 2.1 line.
Any input/suggestions on this PR would be appreciated.