Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.74k stars 2.92k forks source link

ERROR "Failed to get master address during initialization" #15580

Open mjzfrank opened 2 years ago

mjzfrank commented 2 years ago

Alluxio Version: 2.8.0

Describe the bug Master report this error :

2022-05-19 10:51:11,057 ERROR ClientMasterSync - Failed to get master address during initialization
alluxio.exception.status.UnavailableException: Failed to handshake with master hadoop001:19998 to load cluster default configuration
 values: UNAVAILABLE: io exception
        at alluxio.util.ConfigurationUtils.loadConfiguration(ConfigurationUtils.java:520)
        at alluxio.ClientContext.loadConf(ClientContext.java:134)
        at alluxio.client.metrics.ClientMasterSync.loadConf(ClientMasterSync.java:122)
        at alluxio.client.metrics.ClientMasterSync.heartbeat(ClientMasterSync.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
        at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
        at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
        at alluxio.grpc.MetaMasterConfigurationServiceGrpc$MetaMasterConfigurationServiceBlockingStub.getConfiguration(MetaMasterCon
figurationServiceGrpc.java:424)
        at alluxio.util.ConfigurationUtils.loadConfiguration(ConfigurationUtils.java:514)
        ... 10 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: hadoop001/${ip}:19998
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
        at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:748)
2022-05-19 10:51:23,993 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@710c2b53{/,file:///opt/alluxio-2.8.0/webui/mast
er/build/,AVAILABLE}

To Reproduce ./alluxio-start.sh all

Expected behavior how to fix this

Urgency

Are you planning to fix it No

Additional context here is my properties

alluxio.master.hostname=hadoop001
alluxio.master.mount.table.root.ufs=hdfs://mycluster/alluxio/data
alluxio.underfs.hdfs.configuration=${hadoop}/core-site.xml:${hadoop}/hdfs-site.xml
alluxio.debug=true

alluxio.security.authorization.permission.enabled=false
alluxio.security.authentication.type=NOSASL

alluxio.master.journal.type=UFS
alluxio.master.journal.folder=hdfs://mycluster/alluxio/journal
alluxio.master.journal.log.size.bytes.max=200MB

alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/opt/alluxiotieredstore/ramdisk
alluxio.worker.tieredstore.level0.dirs.quota=10GB
alluxio.worker.tieredstore.level0.reserved.ratio=0.2
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM

alluxio.worker.tieredstore.level1.alias=SSD
alluxio.worker.tieredstore.level1.dirs.path=/opt/alluxiotieredstore/alluxio-blockpool
alluxio.worker.tieredstore.level1.dirs.quota=100GB
alluxio.worker.tieredstore.level1.reserved.ratio=0.1
alluxio.worker.tieredstore.level1.dirs.mediumtype=SSD

alluxio.worker.web.port=7009
alluxio.master.web.port=7010
alluxio.user.metrics.collection.enabled=true
alluxio.user.file.readtype.default=CACHE

./alluxio fs works well and the port 19998 is on , why master.log report this error ?

YangchenYe323 commented 2 years ago

It doesn't seem to be a bug on Alluxio's side. Some possible fixes:

  1. hadoop001 is a custom hostname, make sure worker's machine recognizes this hostname(by configuring proper mappings in /etc/hosts on CentOS, for example), or try to configure Alluxio with explicit IP addresses.
  2. Make sure master's rpc port is open. Alluxio processes uses many different ports for different uses: https://docs.alluxio.io/os/user/stable/en/deploy/Requirements.html#:~:text=Inbound%20TCP%2029999%20%2D%20The%20Alluxio,job%20worker's%20default%20data%20port
LiuJiahao0001 commented 2 years ago

@YangchenYe323 @mjzfrank I had the same problem running alluxio cluster 2.8.0 (the most basic configuration). Excuse me, how should this be solved? I have changed the master hostname many times, but the problem still exists.

LiuJiahao0001 commented 2 years ago

Some ports are not open, which leads to my problem. After opening the required ports, the problem is resolved. thank you

np-ftrwei commented 1 year ago

Oh well - which ports did you open? @LiuJiahao0001 ? Also, ports on master / worker / pod / node / container ... Sharing some useful info will help others

kelvin17 commented 1 year ago

I have the same issue when i deploy on local mode. the master is OK,but the worker start faild. the error is the line 1 in the worker.log. And I can access the webui on port 30000(worker) and 19999(master) And I can use cmd like ‘./bin/alluxio fs ls /’ But I can not use cmd ‘./bin/alluxio fs cat /xxx.txt’

2022-10-20 10:58:58,395 WARN  RetryUtils - Failed to load cluster default configuration with master (attempt 8): alluxio.exception.status.UnavailableException: Failed to handshake with master localhost:19998 to load cluster default configuration values: UNAVAILABLE: io exception
2022-10-20 10:59:34,835 INFO  TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=30.46.113.147, rack=null)
2022-10-20 10:59:34,846 INFO  BlockWorkerFactory - Creating DefaultBlockWorker
2022-10-20 10:59:34,870 INFO  DefaultStorageDir - Folder /Volumes/ramdisk/alluxioworker was created!
2022-10-20 10:59:34,870 INFO  DefaultStorageDir - StorageDir initialized: path=/Volumes/ramdisk/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@5154011c, dirIndex=0, medium=MEM, capacityBytes=1073741824, reservedBytes=0, availableBytes=1073741824
2022-10-20 10:59:34,933 INFO  MetricsSystem - Starting sinks with config: {}.
2022-10-20 10:59:34,935 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-6951769813988764773. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-10-20 10:59:35,119 INFO  log - Logging initialized @55080ms to org.eclipse.jetty.util.log.Slf4jLog
2022-10-20 10:59:35,138 INFO  MetricsHeartbeatContext - Created metrics heartbeat with ID app-704029929771442854. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2022-10-20 10:59:43,811 INFO  NettyUtils - EPOLL is not available, will use NIO
2022-10-20 11:00:05,323 INFO  GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999
2022-10-20 11:00:05,325 INFO  ProcessUtils - Starting Alluxio worker @30.46.113.147:29999.
2022-10-20 11:00:05,325 INFO  ProcessUtils - Alluxio version: 2.8.1-d8b71d66f26086d8918b0ae40f0da458c8af8b4c
2022-10-20 11:00:05,325 INFO  ProcessUtils - Java version: 1.8.0_275
2022-10-20 11:00:05,382 INFO  BlockMapIterator - Worker register stream batchSize=1000000
2022-10-20 11:00:05,407 INFO  RegisterStreamer - Worker 5506142347260564401 - All requests have been sent. Completing the client side.
2022-10-20 11:00:05,425 INFO  RegisterStreamer - Worker 5506142347260564401 - Waiting on the master side to complete
2022-10-20 11:00:05,431 INFO  RegisterStreamer - 5506142347260564401 - Complete message received from the server. Closing stream
2022-10-20 11:00:05,432 INFO  RegisterStreamer - Worker 5506142347260564401 - Finished registration with a stream
2022-10-20 11:00:05,434 INFO  WebServer - Alluxio worker web service starting @ /0.0.0.0:30000
2022-10-20 11:00:05,435 INFO  Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_275-b01
2022-10-20 11:00:05,464 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@42f8285e{/metrics/json,null,AVAILABLE}
2022-10-20 11:00:05,464 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@26bab2f1{/metrics/prometheus,null,AVAILABLE}
2022-10-20 11:00:05,466 WARN  SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@7a560583{/,file:///Users/jingyuan/Documents/project/bkmpp-alluxio/webui/worker/build/,STARTING} has uncovered http methods for path: /
2022-10-20 11:00:12,994 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@7a560583{/,file:///Users/jingyuan/Documents/project/bkmpp-alluxio/webui/worker/build/,AVAILABLE}
2022-10-20 11:00:13,002 INFO  AbstractConnector - Started ServerConnector@3fce8fd9{HTTP/1.1, (http/1.1)}{0.0.0.0:30000}
2022-10-20 11:00:13,002 INFO  Server - Started @92964ms
2022-10-20 11:00:13,002 INFO  WebServer - Alluxio worker web service started @ /0.0.0.0:30000
2022-10-20 11:00:13,010 INFO  AlluxioWorkerProcess - Alluxio worker started. id=5506142347260564401, bindHost=0.0.0.0, connectHost=30.46.113.147, rpcPort=29999, webPort=30000
2022-10-20 11:13:50,378 INFO  GrpcDataServer - Shutting down Alluxio worker gRPC server at 0.0.0.0/0.0.0.0:29999.
2022-10-20 11:13:50,381 INFO  AlluxioWorkerProcess - Alluxio worker ended
2022-10-20 11:13:50,381 INFO  ProcessUtils - Stopping Alluxio worker @30.46.113.147:29999.
2022-10-20 11:13:54,474 INFO  AbstractConnector - Stopped ServerConnector@3fce8fd9{HTTP/1.1, (http/1.1)}{0.0.0.0:30000}
2022-10-20 11:13:54,484 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@7a560583{/,file:///Users/jingyuan/Documents/project/bkmpp-alluxio/webui/worker/build/,STOPPED}
2022-10-20 11:13:54,484 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@26bab2f1{/metrics/prometheus,null,STOPPED}
2022-10-20 11:13:54,484 INFO  ContextHandler - Stopped o.e.j.s.ServletContextHandler@42f8285e{/metrics/json,null,STOPPED}
2022-10-20 11:13:54,486 WARN  JvmPauseMonitor - JVM pause monitor interrupted during sleep.
ghost commented 1 year ago

@kelvin17 do you fixed the issue? i have the same issue 2.8

andyzheung commented 1 year ago

LiuJiahao0001

how to solve this problem?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

lokeshkh92 commented 1 year ago

I am also facing the same issue with Alluxio version 2.9.0 Has someone got any solution for the same?

Please note that I have correct entries in "/etc/hosts" as well.

2023-02-11 09:33:56,187 ERROR ClientMasterSync - Failed to get master address during initialization alluxio.exception.status.UnavailableException: Failed to handshake with master zmlssrelbapp02:19998 to load cluster default configuration values: UNIMPLEMENTED: Method not found: alluxio.grpc.meta.MetaMasterConfigurationService/GetConfiguration at alluxio.conf.Configuration.loadConfiguration(Configuration.java:458) at alluxio.ClientContext.loadConf(ClientContext.java:120) at alluxio.client.metrics.ClientMasterSync.loadConf(ClientMasterSync.java:122) at alluxio.client.metrics.ClientMasterSync.heartbeat(ClientMasterSync.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.grpc.StatusRuntimeException: UNIMPLEMENTED: Method not found: alluxio.grpc.meta.MetaMasterConfigurationService/GetConfiguration at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at alluxio.grpc.MetaMasterConfigurationServiceGrpc$MetaMasterConfigurationServiceBlockingStub.getConfiguration(MetaMasterConfigurationServiceGrpc.java:424) at alluxio.conf.Configuration.loadConfiguration(Configuration.java:452) ... 10 more

@YangchenYe323 @mjzfrank : if possible, can you please guide?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

eselyavka commented 1 year ago

I was having the same problem in my docker env My Env

Alluxio version: 2.9.3
Docker version 20.10.21, build baeda1f
Arch: Apple M1 Max
OS: Ventura 13.3.1

log error

2023-05-17 21:30:40,218 INFO  [metrics-master-heartbeat-0](NettyUtils.java:146) - EPOLL_MODE is available
2023-05-17 21:30:41,079 ERROR [metrics-master-heartbeat-0](SamplingLogger.java:360) - Failed to get master address during initialization
alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master:19998 to load cluster default configuration values: UNAVAILABLE: io exception
    at alluxio.conf.Configuration.loadConfiguration(Configuration.java:458)
    at alluxio.ClientContext.loadConf(ClientContext.java:123)
    at alluxio.client.metrics.ClientMasterSync.loadConf(ClientMasterSync.java:122)
    at alluxio.client.metrics.ClientMasterSync.heartbeat(ClientMasterSync.java:74)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
    at alluxio.grpc.MetaMasterConfigurationServiceGrpc$MetaMasterConfigurationServiceBlockingStub.getConfiguration(MetaMasterConfigurationServiceGrpc.java:424)
    at alluxio.conf.Configuration.loadConfiguration(Configuration.java:452)
    ... 10 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection reset by peer: alluxio-master/172.21.0.2:19998
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection reset by peer
    at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.lang.Thread.run(Thread.java:750)

It looks like because JVM process is taking too long to come up in docker env, causing heartbeat service to fail because master reside in provisioning state and RPC port is closed. I double this property alluxio.user.metrics.heartbeat.interval.ms to 20sec and this change did help me.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.