Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.81k stars 2.93k forks source link

Alluxio master throws exception log during startup #17382

Open codings-dan opened 1 year ago

codings-dan commented 1 year ago

Alluxio Version: git commit: 764d3c8d0cd7d24db4b18f6a6569ba05202b9b6f

Describe the bug

Alluxio master throws exception log during startup as below:

Starting master @ DRAGONYLIU-MB2. Logging to /Users/yaolongliu/local/myproject/alluxio/logs
--- [ FAILED ] The master @ DRAGONYLIU-MB2 is not serving requests.
--- Printing the log tail for /Users/yaolongliu/local/myproject/alluxio/logs/master.log
>>> BEGIN
2023-05-05 21:01:25,177 INFO  [localhost_19200@group-ABB3109A44C1-SegmentedRaftLogWorker](SegmentedRaftLogWorker.java:613) - localhost_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /Users/yaolongliu/local/myproject/alluxio/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_11 to /Users/yaolongliu/local/myproject/alluxio/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_11-14
2023-05-05 21:01:25,185 INFO  [localhost_19200@group-ABB3109A44C1-SegmentedRaftLogWorker](SegmentedRaftLogWorker.java:656) - localhost_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /Users/yaolongliu/local/myproject/alluxio/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_15
2023-05-05 21:01:25,988 ERROR [main](MasterHealthCheckClient.java:294) - Exception thrown in master health check client
java.util.concurrent.ExecutionException: java.lang.RuntimeException: alluxio.exception.status.NotFoundException: Service FILE_SYSTEM_MASTER_CLIENT_SERVICE is not found.
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at alluxio.master.MasterHealthCheckClient.isServing(MasterHealthCheckClient.java:290)
    at alluxio.master.AlluxioMasterMonitor.main(AlluxioMasterMonitor.java:53)
Caused by: java.lang.RuntimeException: alluxio.exception.status.NotFoundException: Service FILE_SYSTEM_MASTER_CLIENT_SERVICE is not found.
    at alluxio.common.RpcPortHealthCheckClient.isServing(RpcPortHealthCheckClient.java:73)
    at alluxio.master.MasterHealthCheckClient$MasterServingHealthCheck.run(MasterHealthCheckClient.java:165)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: alluxio.exception.status.NotFoundException: Service FILE_SYSTEM_MASTER_CLIENT_SERVICE is not found.
    at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:141)
    at alluxio.exception.status.AlluxioStatusException.fromStatusRuntimeException(AlluxioStatusException.java:215)
    at alluxio.exception.status.AlluxioStatusException.fromThrowable(AlluxioStatusException.java:197)
    at alluxio.util.network.NetworkAddressUtils.pingService(NetworkAddressUtils.java:730)
    at alluxio.common.RpcPortHealthCheckClient.isServing(RpcPortHealthCheckClient.java:66)
    ... 6 more
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: Service FILE_SYSTEM_MASTER_CLIENT_SERVICE is not found.
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
    at alluxio.grpc.ServiceVersionClientServiceGrpc$ServiceVersionClientServiceBlockingStub.getServiceVersion(ServiceVersionClientServiceGrpc.java:171)
    at alluxio.util.network.NetworkAddressUtils.pingService(NetworkAddressUtils.java:727)
    ... 7 more
<<< EOF
--- Printing the log tail for /Users/yaolongliu/local/myproject/alluxio/logs/master.out

To Reproduce complie alluxio and then execute bin/alluxio-start.sh master

Expected behavior No exception throws

Urgency High

Additional context I suppose it may be related to https://github.com/Alluxio/alluxio/pull/16998

codings-dan commented 1 year ago

@elega @jenoudet Could you take a look at this issue? Thanks in advance!

humengyu2012 commented 1 year ago

I also encountered the same issue before, but it was because I lost the configuration when I reinstalled the master.

jenoudet commented 1 year ago

16998 made every master have an active gRPC server at all times, both as follower and leader. When the master first starts, it starts as follower and only registers the RAFT_JOURNAL_SERVICE. The health check manages to connect to said gRPC server but cannot find the FILE_SYSTEM_MASTER_CLIENT_SERVICE, which causes the above exception to be thrown.

Thanks for bringing this bug to my attention. I'm working on a fix. In the meantime, a quick fix would be to run bin/alluxio-start.sh -a master. This launches the master in the background asynchronously and should not report any errors.

elega commented 1 year ago

@codings-dan So I looked into this issue a bit. The monitor is a separate thread started by bin/alluxio-start.sh master. So the error you saw was from that thread instead of the alluxio master. This monitor only pings the master to check the master healthiness when the master is not in HA (https://github.com/Alluxio/alluxio/blob/main/dora/shell/src/main/java/alluxio/master/MasterHealthCheckClient.java#L272-L292). So you will only see this issue in the single master mode. Since after Arthur's change, the master starts with a partial grpc server in a STANDBY state and then converts to PRIMARY later, RpcPortHealthCheckClient might got the NOT_FOUND exception when the grpc server has not been started yet. I think Arthur's fix works and fixed this problem.

One way to solve this issue better is that if single master is used, we don't go through the master selection process and just start all grpc servers on the master as we now it must be a PRIMARY. However, i doubt if this is worth doing because in production environment, most users use the HA mode.

In addition, you might concern if such behavior change causes more issues in the code base. As long as the PollingMasterInquireClient is used, there is no issue as we've taken care of the NotFound case already (https://github.com/Alluxio/alluxio/blob/master-2.x/core/common/src/main/java/alluxio/master/PollingMasterInquireClient.java#L144-L147).

maobaolong commented 1 year ago

@elega @jenoudet Hi , JournalIntegrationTest#concurrentWriteAndRestart NPE but passed test.

The following is the error stack.

java.lang.NullPointerException
    at alluxio.client.file.FileSystemContext.acquireClosableClientResource(FileSystemContext.java:556)
    at alluxio.client.file.FileSystemContext.acquireMasterClientResource(FileSystemContext.java:511)
    at alluxio.client.file.BaseFileSystem.rpc(BaseFileSystem.java:662)
    at alluxio.client.file.BaseFileSystem.createFile(BaseFileSystem.java:177)
    at alluxio.client.file.FileSystem.createFile(FileSystem.java:262)
    at alluxio.server.ft.journal.JournalIntegrationTest$FileToucher.run(JournalIntegrationTest.java:128)
    at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-16" java.lang.NullPointerException
    at alluxio.client.file.FileSystemContext.acquireClosableClientResource(FileSystemContext.java:556)
    at alluxio.client.file.FileSystemContext.acquireMasterClientResource(FileSystemContext.java:511)
    at alluxio.client.file.BaseFileSystem.rpc(BaseFileSystem.java:662)
    at alluxio.client.file.BaseFileSystem.createFile(BaseFileSystem.java:177)
    at alluxio.client.file.FileSystem.createFile(FileSystem.java:262)
    at alluxio.server.ft.journal.JournalIntegrationTest$FileToucher.run(JournalIntegrationTest.java:128)
    at java.lang.Thread.run(Thread.java:748)

The related code is

image

jenoudet commented 1 year ago

I'm taking a look at it.

maobaolong commented 1 year ago

The reproduce step is sth like following.

image

The following is the start stop flow about this tests.

image

As we add a stop 10s in a master component, so the standby state live time getting larger than before, then the change for rpc call achieve to standby mode master getting larger too.

The tests use SingleMasterInquireClient by default, it access to master whatever it is primary or standby.

image

The flowing to do suggestion

1 Change the promote/demote flow to stopRpc->Promote/Demote MasterComponent->startRpc(newState) 2 StandbyException should be catch while a rpc call access to standby master accidentially. 3 We can change to use an individual snapshot rpc service to serve snapshot related operation between masters other than introduce standby grpc server. Introduced by https://github.com/Alluxio/alluxio/pull/16998 Fixed by https://github.com/Alluxio/alluxio/pull/17446

jenoudet commented 1 year ago

Would it be possible to create a new MasterInquireClient instead? One that queries the master leadership state through an RPC, rather than relying on the presence/absence of a gRPC server at a specific address? This would also be helpful for the multi-master scenario. I prefer this option because it creates a dedicated endpoint that can be used to determine the leader: looking to the future, I think it will be helpful to separate the concepts of running/stopped RPC server and primary/secondary master.

maobaolong commented 1 year ago

@jenoudet Yeah, this is great! This like the hdfs way, the rpc service running all the life time whatever it is Primary or standby, and it through the standby exception for the standby state namenode.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.