Open codings-dan opened 1 year ago
@elega @jenoudet Could you take a look at this issue? Thanks in advance!
I also encountered the same issue before, but it was because I lost the configuration when I reinstalled the master.
RAFT_JOURNAL_SERVICE
. The health check manages to connect to said gRPC server but cannot find the FILE_SYSTEM_MASTER_CLIENT_SERVICE
, which causes the above exception to be thrown.Thanks for bringing this bug to my attention. I'm working on a fix.
In the meantime, a quick fix would be to run bin/alluxio-start.sh -a master
. This launches the master in the background asynchronously and should not report any errors.
@codings-dan So I looked into this issue a bit. The monitor is a separate thread started by bin/alluxio-start.sh master. So the error you saw was from that thread instead of the alluxio master. This monitor only pings the master to check the master healthiness when the master is not in HA (https://github.com/Alluxio/alluxio/blob/main/dora/shell/src/main/java/alluxio/master/MasterHealthCheckClient.java#L272-L292). So you will only see this issue in the single master mode. Since after Arthur's change, the master starts with a partial grpc server in a STANDBY state and then converts to PRIMARY later, RpcPortHealthCheckClient might got the NOT_FOUND exception when the grpc server has not been started yet. I think Arthur's fix works and fixed this problem.
One way to solve this issue better is that if single master is used, we don't go through the master selection process and just start all grpc servers on the master as we now it must be a PRIMARY. However, i doubt if this is worth doing because in production environment, most users use the HA mode.
In addition, you might concern if such behavior change causes more issues in the code base. As long as the PollingMasterInquireClient is used, there is no issue as we've taken care of the NotFound case already (https://github.com/Alluxio/alluxio/blob/master-2.x/core/common/src/main/java/alluxio/master/PollingMasterInquireClient.java#L144-L147).
@elega @jenoudet Hi , JournalIntegrationTest#concurrentWriteAndRestart NPE but passed test.
The following is the error stack.
java.lang.NullPointerException
at alluxio.client.file.FileSystemContext.acquireClosableClientResource(FileSystemContext.java:556)
at alluxio.client.file.FileSystemContext.acquireMasterClientResource(FileSystemContext.java:511)
at alluxio.client.file.BaseFileSystem.rpc(BaseFileSystem.java:662)
at alluxio.client.file.BaseFileSystem.createFile(BaseFileSystem.java:177)
at alluxio.client.file.FileSystem.createFile(FileSystem.java:262)
at alluxio.server.ft.journal.JournalIntegrationTest$FileToucher.run(JournalIntegrationTest.java:128)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-16" java.lang.NullPointerException
at alluxio.client.file.FileSystemContext.acquireClosableClientResource(FileSystemContext.java:556)
at alluxio.client.file.FileSystemContext.acquireMasterClientResource(FileSystemContext.java:511)
at alluxio.client.file.BaseFileSystem.rpc(BaseFileSystem.java:662)
at alluxio.client.file.BaseFileSystem.createFile(BaseFileSystem.java:177)
at alluxio.client.file.FileSystem.createFile(FileSystem.java:262)
at alluxio.server.ft.journal.JournalIntegrationTest$FileToucher.run(JournalIntegrationTest.java:128)
at java.lang.Thread.run(Thread.java:748)
The related code is
I'm taking a look at it.
The reproduce step is sth like following.
The following is the start stop flow about this tests.
As we add a stop 10s in a master component, so the standby state live time getting larger than before, then the change for rpc call achieve to standby mode master getting larger too.
The tests use SingleMasterInquireClient by default, it access to master whatever it is primary or standby.
1 Change the promote/demote flow to stopRpc->Promote/Demote MasterComponent->startRpc(newState) 2 StandbyException should be catch while a rpc call access to standby master accidentially. 3 We can change to use an individual snapshot rpc service to serve snapshot related operation between masters other than introduce standby grpc server. Introduced by https://github.com/Alluxio/alluxio/pull/16998 Fixed by https://github.com/Alluxio/alluxio/pull/17446
Would it be possible to create a new MasterInquireClient
instead? One that queries the master leadership state through an RPC, rather than relying on the presence/absence of a gRPC server at a specific address? This would also be helpful for the multi-master scenario.
I prefer this option because it creates a dedicated endpoint that can be used to determine the leader: looking to the future, I think it will be helpful to separate the concepts of running/stopped RPC server and primary/secondary master.
@jenoudet Yeah, this is great! This like the hdfs way, the rpc service running all the life time whatever it is Primary or standby, and it through the standby exception for the standby state namenode.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
Alluxio Version: git commit: 764d3c8d0cd7d24db4b18f6a6569ba05202b9b6f
Describe the bug
Alluxio master throws exception log during startup as below:
To Reproduce complie alluxio and then execute
bin/alluxio-start.sh master
Expected behavior No exception throws
Urgency High
Additional context I suppose it may be related to https://github.com/Alluxio/alluxio/pull/16998