apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
893 stars 361 forks source link

[QUESTION] worker数据目录对磁盘有什么要求么? #93

Closed fuhaiq closed 2 years ago

fuhaiq commented 2 years ago

我在公司虚拟机上启动worker时报错如下:

[11:57:43.240] INFO  com.aliyun.emr.rss.common.internal.Logging 54 logInfo - Dispatcher numThreads: 64
[11:57:43.274] DEBUG io.netty.util.internal.logging.InternalLoggerFactory 63 useSlf4JLoggerFactory - Using SLF4J as the default logging framework
[11:57:43.279] DEBUG io.netty.util.internal.InternalThreadLocalMap 86 <clinit> - -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024
[11:57:43.279] DEBUG io.netty.util.internal.InternalThreadLocalMap 89 <clinit> - -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096
[11:57:43.286] INFO  com.aliyun.emr.rss.common.network.client.TransportClientFactory 98 <init> - mode NIO threads 64
[11:57:43.291] DEBUG io.netty.channel.MultithreadEventLoopGroup 44 <clinit> - -Dio.netty.eventLoopThreads: 24
[11:57:43.375] DEBUG io.netty.util.internal.PlatformDependent0 460 explicitNoUnsafeCause0 - -Dio.netty.noUnsafe: false
[11:57:43.376] DEBUG io.netty.util.internal.PlatformDependent0 954 javaVersion0 - Java version: 8
[11:57:43.377] DEBUG io.netty.util.internal.PlatformDependent0 135 <clinit> - sun.misc.Unsafe.theUnsafe: available
[11:57:43.377] DEBUG io.netty.util.internal.PlatformDependent0 159 <clinit> - sun.misc.Unsafe.copyMemory: available
[11:57:43.378] DEBUG io.netty.util.internal.PlatformDependent0 202 <clinit> - java.nio.Buffer.address: available
[11:57:43.379] DEBUG io.netty.util.internal.PlatformDependent0 272 <clinit> - direct buffer constructor: available
[11:57:43.379] DEBUG io.netty.util.internal.PlatformDependent0 350 <clinit> - java.nio.Bits.unaligned: available, true
[11:57:43.380] DEBUG io.netty.util.internal.PlatformDependent0 424 <clinit> - jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
[11:57:43.380] DEBUG io.netty.util.internal.PlatformDependent0 446 <clinit> - java.nio.DirectByteBuffer.<init>(long, int): available
[11:57:43.380] DEBUG io.netty.util.internal.PlatformDependent 1116 unsafeUnavailabilityCause0 - sun.misc.Unsafe: available
[11:57:43.381] DEBUG io.netty.util.internal.PlatformDependent 1237 tmpdir0 - -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
[11:57:43.381] DEBUG io.netty.util.internal.PlatformDependent 1316 bitMode0 - -Dio.netty.bitMode: 64 (sun.arch.data.model)
[11:57:43.382] DEBUG io.netty.util.internal.PlatformDependent 178 <clinit> - -Dio.netty.maxDirectMemory: 4294967296 bytes
[11:57:43.382] DEBUG io.netty.util.internal.PlatformDependent 185 <clinit> - -Dio.netty.uninitializedArrayAllocationThreshold: -1
[11:57:43.384] DEBUG io.netty.util.internal.CleanerJava6 92 <clinit> - java.nio.ByteBuffer.cleaner(): available
[11:57:43.384] DEBUG io.netty.util.internal.PlatformDependent 205 <clinit> - -Dio.netty.noPreferDirect: false
[11:57:43.385] DEBUG io.netty.channel.nio.NioEventLoop 109 <clinit> - -Dio.netty.noKeySetOptimization: false
[11:57:43.386] DEBUG io.netty.channel.nio.NioEventLoop 110 <clinit> - -Dio.netty.selectorAutoRebuildThreshold: 512
[11:57:43.393] DEBUG io.netty.util.internal.PlatformDependent$Mpsc 967 <clinit> - org.jctools-core.MpscChunkedArrayQueue: available
[11:57:43.418] DEBUG io.netty.util.ResourceLeakDetector 129 <clinit> - -Dio.netty.leakDetection.level: simple
[11:57:43.418] DEBUG io.netty.util.ResourceLeakDetector 130 <clinit> - -Dio.netty.leakDetection.targetRecords: 4
[11:57:43.422] DEBUG io.netty.buffer.PooledByteBufAllocator 155 <clinit> - -Dio.netty.allocator.numHeapArenas: 18
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 156 <clinit> - -Dio.netty.allocator.numDirectArenas: 24
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 158 <clinit> - -Dio.netty.allocator.pageSize: 8192
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 163 <clinit> - -Dio.netty.allocator.maxOrder: 11
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 167 <clinit> - -Dio.netty.allocator.chunkSize: 16777216
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 168 <clinit> - -Dio.netty.allocator.smallCacheSize: 256
[11:57:43.423] DEBUG io.netty.buffer.PooledByteBufAllocator 169 <clinit> - -Dio.netty.allocator.normalCacheSize: 64
[11:57:43.424] DEBUG io.netty.buffer.PooledByteBufAllocator 170 <clinit> - -Dio.netty.allocator.maxCachedBufferCapacity: 32768
[11:57:43.424] DEBUG io.netty.buffer.PooledByteBufAllocator 171 <clinit> - -Dio.netty.allocator.cacheTrimInterval: 8192
[11:57:43.424] DEBUG io.netty.buffer.PooledByteBufAllocator 172 <clinit> - -Dio.netty.allocator.cacheTrimIntervalMillis: 0
[11:57:43.424] DEBUG io.netty.buffer.PooledByteBufAllocator 173 <clinit> - -Dio.netty.allocator.useCacheForAllThreads: true
[11:57:43.424] DEBUG io.netty.buffer.PooledByteBufAllocator 174 <clinit> - -Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
[11:57:43.469] DEBUG io.netty.channel.DefaultChannelId 79 <clinit> - -Dio.netty.processId: 1071549 (auto-detected)
[11:57:43.471] DEBUG io.netty.util.NetUtil 135 <clinit> - -Djava.net.preferIPv4Stack: false
[11:57:43.471] DEBUG io.netty.util.NetUtil 136 <clinit> - -Djava.net.preferIPv6Addresses: false
[11:57:43.474] DEBUG io.netty.util.NetUtilInitializations 129 determineLoopback - Loopback interface: lo (lo, 0:0:0:0:0:0:0:1%lo)
[11:57:43.475] DEBUG io.netty.util.NetUtil$1 169 run - /proc/sys/net/core/somaxconn: 128
[11:57:43.476] DEBUG io.netty.channel.DefaultChannelId 101 <clinit> - -Dio.netty.machineId: 02:93:a0:ff:fe:cb:76:96 (auto-detected)
[11:57:43.500] DEBUG io.netty.buffer.ByteBufUtil 87 <clinit> - -Dio.netty.allocator.type: pooled
[11:57:43.500] DEBUG io.netty.buffer.ByteBufUtil 96 <clinit> - -Dio.netty.threadLocalDirectBufferSize: 0
[11:57:43.500] DEBUG io.netty.buffer.ByteBufUtil 99 <clinit> - -Dio.netty.maxThreadLocalCharBufferSize: 16384
[11:57:43.513] DEBUG com.aliyun.emr.rss.common.network.server.TransportServer 154 init - Shuffle server started on port: 46046
[11:57:43.518] INFO  com.aliyun.emr.rss.common.internal.Logging 54 logInfo - Successfully started service 'WorkerSys' on port 46046.
[11:57:43.535] INFO  com.aliyun.emr.rss.common.network.server.MemoryTracker 101 <init> - Memory tracker initialized with :  
 max direct memory : 4294967296 (4096.0 MB)
 direct memory critical : 3865470566 (3686.3999996185303 MB)
[11:57:43.603] ERROR com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$ 274 createDeviceMonitor - Device monitor init failed. java.lang.NullPointerException: null
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.Iterator.foreach(Iterator.scala:941) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.Iterator.foreach$(Iterator.scala:941) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[worker-1.0.0-shaded.jar:?]
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120) ~[worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174) ~[worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266) [worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:199) [worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:74) [worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:960) [worker-1.0.0-shaded.jar:?]
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala) [worker-1.0.0-shaded.jar:?]

Exception in thread "main" java.lang.NullPointerException
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:199)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:74)
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:960)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)

尝试了很多目录(rss.worker.base.dirs),包括/dev/shm,获取设备信息报错。但是在阿里云ECS上没有问题,感觉是公司虚拟机磁盘的问题,有简单的检测办法么?

/cc @who-need-to-know

/assign @who-can-help-you

FMX commented 2 years ago

@fuhaiq 你好,RSS 会使用 rss.worker.base.dirs 配置的目录作为base,默认会加上 16 个子目录。 这里会检查磁盘状态,通过 /sys/block/ 目录下同名的 device Name 检查。 对于非标准环境来说,你可以考虑先配置 rss.device.monitor.enabled 为 false。

详情参见 DeviceInfo 类中的代码,可能是这里不支持你的虚拟机磁盘配置。

同时你可以考虑把 df 命令的结果贴上来。