apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
893 stars 361 forks source link

[BUG] NullPointerException for worker on LVM format #46

Closed kettlelinna closed 2 years ago

kettlelinna commented 2 years ago

What is the bug?

Throw NullPointerException when I start worker on node which use LVM. The way I see it, com.aliyun.emr.rss.service.deploy.worker.DeviceInfo#getDeviceAndMountInfos not support LVM format, it can't get correct mount information.

How to reproduce the bug?

start worker by $RSS_HOME/sbin/start-worker.sh rss://node01:9097 on node wich use LVM format

Could you share logs or screenshots?

image

/cc @who-need-to-know

/assign @who-can-solve-this-bug

kettlelinna commented 2 years ago

Some information of my env maybe help u image image

kettlelinna commented 2 years ago

By the way, when I limit below information, the worker can work image

FMX commented 2 years ago

@kettlelinna
You can disable disk monitor by setting rss.device.monitor.enabled=false. We'll fix this later.

bigdata-spec commented 2 years ago

rss.devi @FMX when I set rss.device.monitor.enabled=false But

22/11/01 10:17:00,788 ERROR [main] DeviceMonitor: Device monitor init failed.
java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:221)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:80)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:994)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
Exception in thread "main" java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:221)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:80)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:994)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)

version 0.1.1

FMX commented 2 years ago

@jiangbiao910 Hi, I think you can try the version v0.1.3. The version v0.1.1 is released at july 22.

bigdata-spec commented 2 years ago

@jiangbiao910 Hi, I think you can try the version v0.1.3. The version v0.1.1 is released at july 22. The virtual machine is running properly and the k8s deployment is this error. thanks ,I will try 0.1.3 immediately

bigdata-spec commented 2 years ago

@FMX HI,I mvn code on windows ,run on centos /root/helm-celeborn/celeborn-main/docker/rss-0.1.3-bin-release/conf/rss-env.sh: line 2: $'\r': command not found /root/helm-celeborn/celeborn-main/docker/rss-0.1.3-bin-release/conf/rss-env.sh: line 10: $'\r': command not found dos2unix xxx may can work. Is there any other way?

FMX commented 2 years ago

@jiangbiao910 I think you can use the pre-built release package or packaging it on Linux.

bigdata-spec commented 2 years ago

@jiangbiao910 I think you can use the pre-built release package or packaging it on Linux.

when I try 0.1.3 on k8s

root@celeborn-worker-0:/opt/celeborn/logs# tail -100f rss--com.aliyun.emr.rss.service.deploy.worker.Worker-1-celeborn-worker-0.out
Using Spark's default log4j profile: log4j-defaults.properties
22/11/01 12:32:29,081 INFO [main] Dispatcher: Dispatcher numThreads: 64
22/11/01 12:32:29,095 DEBUG [main] InternalLoggerFactory: Using SLF4J as the default logging framework
22/11/01 12:32:29,096 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024
22/11/01 12:32:29,096 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096
22/11/01 12:32:29,103 INFO [main] TransportClientFactory: mode NIO threads 64
22/11/01 12:32:29,105 DEBUG [main] MultithreadEventLoopGroup: -Dio.netty.eventLoopThreads: 2
22/11/01 12:32:29,130 DEBUG [main] PlatformDependent0: -Dio.netty.noUnsafe: false
22/11/01 12:32:29,130 DEBUG [main] PlatformDependent0: Java version: 8
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.theUnsafe: available
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.copyMemory: available
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.storeFence: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.Buffer.address: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: direct buffer constructor: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.Bits.unaligned: available, true
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.DirectByteBuffer.<init>(long, int): available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent: sun.misc.Unsafe: available
22/11/01 12:32:29,133 DEBUG [main] PlatformDependent: -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
22/11/01 12:32:29,133 DEBUG [main] PlatformDependent: -Dio.netty.bitMode: 64 (sun.arch.data.model)
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.maxDirectMemory: 1073741824 bytes
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.uninitializedArrayAllocationThreshold: -1
22/11/01 12:32:29,134 DEBUG [main] CleanerJava6: java.nio.ByteBuffer.cleaner(): available
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.noPreferDirect: false
22/11/01 12:32:29,135 DEBUG [main] NioEventLoop: -Dio.netty.noKeySetOptimization: false
22/11/01 12:32:29,135 DEBUG [main] NioEventLoop: -Dio.netty.selectorAutoRebuildThreshold: 512
22/11/01 12:32:29,139 DEBUG [main] PlatformDependent: org.jctools-core.MpscChunkedArrayQueue: available
22/11/01 12:32:29,162 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
22/11/01 12:32:29,162 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.targetRecords: 4
22/11/01 12:32:29,163 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numHeapArenas: 2
22/11/01 12:32:29,163 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numDirectArenas: 2
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.pageSize: 8192
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxOrder: 9
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.chunkSize: 4194304
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.smallCacheSize: 256
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.normalCacheSize: 64
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedBufferCapacity: 32768
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimInterval: 8192
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimIntervalMillis: 0
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.useCacheForAllThreads: false
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
22/11/01 12:32:29,193 DEBUG [main] DefaultChannelId: -Dio.netty.processId: 16 (auto-detected)
22/11/01 12:32:29,194 DEBUG [main] NetUtil: -Djava.net.preferIPv4Stack: false
22/11/01 12:32:29,195 DEBUG [main] NetUtil: -Djava.net.preferIPv6Addresses: false
22/11/01 12:32:29,196 DEBUG [main] NetUtilInitializations: Loopback interface: lo (lo, 127.0.0.1)
22/11/01 12:32:29,196 DEBUG [main] NetUtil: /proc/sys/net/core/somaxconn: 4096
22/11/01 12:32:29,197 DEBUG [main] DefaultChannelId: -Dio.netty.machineId: 4e:98:e4:ff:fe:6d:63:c9 (auto-detected)
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.allocator.type: pooled
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.threadLocalDirectBufferSize: 0
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.maxThreadLocalCharBufferSize: 16384
22/11/01 12:32:29,221 DEBUG [main] TransportServer: Shuffle server started on port: 42007
22/11/01 12:32:29,223 INFO [main] Utils: Successfully started service 'WorkerSys' on port 42007.
22/11/01 12:32:29,332 ERROR [main] DeviceMonitor: Device monitor init failed.
java.lang.NullPointerException
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
Exception in thread "main" java.lang.NullPointerException
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)

rss.device.monitor.enabled false

rss.device.monitor.enabled false this two setting both error.

FMX commented 2 years ago

@jiangbiao910 I think you can check your rss configuration file first. If you set rss.device.monitor.enabled = false, the device monitor will not initialize.

FMX commented 2 years ago

@jiangbiao910 If you have a DingTalk account, you can connect with us by joining the group below.

IMG_0065

bigdata-spec commented 2 years ago

@jiangbiao910 I think you can check your rss configuration file first. If you set rss.device.monitor.enabled = false, the device monitor will not initialize. image

But it still error

FMX commented 2 years ago

I think you might have misconfigured. Can you share your complete worker log?

bigdata-spec commented 2 years ago

I think you might have misconfigured. Can you share your complete worker log?

[root@celeborn-worker-0 logs]# tail -100f rss--com.aliyun.emr.rss.service.deploy.worker.Worker-1-celeborn-worker-0.out
Using Spark's default log4j profile: log4j-defaults.properties
22/11/01 13:51:37,075 INFO [main] Dispatcher: Dispatcher numThreads: 64
22/11/01 13:51:37,123 DEBUG [main] InternalLoggerFactory: Using SLF4J as the default logging framework
22/11/01 13:51:37,124 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024
22/11/01 13:51:37,124 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096
22/11/01 13:51:37,131 INFO [main] TransportClientFactory: mode NIO threads 64
22/11/01 13:51:37,140 DEBUG [main] MultithreadEventLoopGroup: -Dio.netty.eventLoopThreads: 2
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: -Dio.netty.noUnsafe: false
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: Java version: 8
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.theUnsafe: available
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.copyMemory: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.storeFence: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: java.nio.Buffer.address: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: direct buffer constructor: available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: java.nio.Bits.unaligned: available, true
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: java.nio.DirectByteBuffer.<init>(long, int): available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: sun.misc.Unsafe: available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: -Dio.netty.bitMode: 64 (sun.arch.data.model)
22/11/01 13:51:37,186 DEBUG [main] PlatformDependent: -Dio.netty.maxDirectMemory: 1073741824 bytes
22/11/01 13:51:37,186 DEBUG [main] PlatformDependent: -Dio.netty.uninitializedArrayAllocationThreshold: -1
22/11/01 13:51:37,186 DEBUG [main] CleanerJava6: java.nio.ByteBuffer.cleaner(): available
22/11/01 13:51:37,187 DEBUG [main] PlatformDependent: -Dio.netty.noPreferDirect: false
22/11/01 13:51:37,188 DEBUG [main] NioEventLoop: -Dio.netty.noKeySetOptimization: false
22/11/01 13:51:37,188 DEBUG [main] NioEventLoop: -Dio.netty.selectorAutoRebuildThreshold: 512
22/11/01 13:51:37,191 DEBUG [main] PlatformDependent: org.jctools-core.MpscChunkedArrayQueue: available
22/11/01 13:51:37,204 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
22/11/01 13:51:37,204 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.targetRecords: 4
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numHeapArenas: 2
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numDirectArenas: 2
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.pageSize: 8192
22/11/01 13:51:37,207 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxOrder: 9
22/11/01 13:51:37,208 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.chunkSize: 4194304
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.smallCacheSize: 256
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.normalCacheSize: 64
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedBufferCapacity: 32768
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimInterval: 8192
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimIntervalMillis: 0
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.useCacheForAllThreads: false
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
22/11/01 13:51:37,247 DEBUG [main] DefaultChannelId: -Dio.netty.processId: 17 (auto-detected)
22/11/01 13:51:37,248 DEBUG [main] NetUtil: -Djava.net.preferIPv4Stack: false
22/11/01 13:51:37,249 DEBUG [main] NetUtil: -Djava.net.preferIPv6Addresses: false
22/11/01 13:51:37,250 DEBUG [main] NetUtilInitializations: Loopback interface: lo (lo, 127.0.0.1)
22/11/01 13:51:37,250 DEBUG [main] NetUtil: /proc/sys/net/core/somaxconn: 4096
22/11/01 13:51:37,251 DEBUG [main] DefaultChannelId: -Dio.netty.machineId: 1e:df:df:ff:fe:aa:fd:e2 (auto-detected)
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.allocator.type: pooled
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.threadLocalDirectBufferSize: 0
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.maxThreadLocalCharBufferSize: 16384
22/11/01 13:51:37,282 DEBUG [main] TransportServer: Shuffle server started on port: 41987
22/11/01 13:51:37,288 INFO [main] Utils: Successfully started service 'WorkerSys' on port 41987.
22/11/01 13:51:37,469 ERROR [main] DeviceMonitor: Device monitor init failed.
java.lang.NullPointerException
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
Exception in thread "main" java.lang.NullPointerException
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
        at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
        at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
        at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
        at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
        at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
command terminated with exit code 137
bigdata-spec commented 2 years ago
[root@k8s01 docker]# helm install celeborn-helm helm1 -n rss --dry-run
NAME: celeborn-helm
LAST DEPLOYED: Tue Nov  1 14:37:15 2022
NAMESPACE: rss
STATUS: pending-install
REVISION: 1
TEST SUITE: None
HOOKS:
MANIFEST:
---
# Source: celeborn/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: celeborn-conf
  labels:
    helm.sh/chart: celeborn-0.1.1
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
data:
  celeborn-defaults.conf: |-
    rss.master.address celeborn-master-0.celeborn-master-svc.rss.svc.cluster.local:9097
    rss.metrics.system.enabled true
    rss.worker.flush.buffer.size 256k
    rss.worker.flush.queue.capacity 4096
    rss.worker.base.dirs /data/datarss1,/data/datarss2
    # If your hosts have disk raid or use lvm, set rss.device.monitor.enabled to false
    rss.device.monitor.enabled false

  rss-env.sh: |
    CELEBORN_MASTER_JAVA_OPTS="-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
    CELEBORN_MASTER_MEMORY="2g"
    CELEBORN_NO_DAEMONIZE="yes"
    CELEBORN_WORKER_JAVA_OPTS="-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
    CELEBORN_WORKER_MEMORY="2g"
    CELEBORN_WORKER_OFFHEAP_MEMORY="12g"
    TZ="Asia/Shanghai"

  log4j-defaults.properties: |-
    # Set everything to be logged to the console
    log4j.rootCategory=DEBUG, console
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{1}: %m%n

#  metrics.properties: >-
#    *.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
---
# Source: celeborn/templates/master-service.yaml
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

apiVersion: v1
kind: Service
metadata:
  name: celeborn-master-svc
  labels:
    helm.sh/chart: celeborn-0.1.1
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  ports:
    - port: 9097
      targetPort: 9097
      protocol: TCP
      name: celeborn-master
  clusterIP: None
  selector:
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/name: celeborn-master
    app.kubernetes.io/role: master
---
# Source: celeborn/templates/worker-service.yaml
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

apiVersion: v1
kind: Service
metadata:
  name: celeborn-worker-svc
  labels:
    helm.sh/chart: celeborn-0.1.1
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/name: celeborn-worker
    app.kubernetes.io/role: worker
---
# Source: celeborn/templates/master-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: celeborn-master
  labels:
    app.kubernetes.io/name: celeborn-master
    app.kubernetes.io/role: master
    helm.sh/chart: celeborn-0.1.1
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: celeborn-master
      app.kubernetes.io/role: master
      app.kubernetes.io/instance: celeborn-helm
  serviceName: celeborn-master-svc
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: celeborn-master
        app.kubernetes.io/role: master
        app.kubernetes.io/instance: celeborn-helm
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - celeborn-master
            topologyKey: kubernetes.io/hostname
      containers:
      - name: celeborn
        image: "remote-shuffle-service:0.1.1-6badd20"
        imagePullPolicy: IfNotPresent
        command:
          - "/usr/bin/tini"
          - "--"
          - "/bin/sh"
          - '-c'
          - "/opt/rss-0.1.3-bin-release/sbin/start-master.sh && sleep 200"
        resources:
            null
        ports:
          - containerPort: 9097
          - containerPort: 9098
            name: metrics
            protocol: TCP
        volumeMounts:
          - mountPath: /opt/rss-0.1.3-bin-release/conf
            name: celeborn-helm-volume
            readOnly: true
        env:
          - name: CELEBORN_MASTER_JAVA_OPTS
            value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
          - name: CELEBORN_MASTER_MEMORY
            value: "2g"
          - name: CELEBORN_NO_DAEMONIZE
            value: "yes"
          - name: CELEBORN_WORKER_JAVA_OPTS
            value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
          - name: CELEBORN_WORKER_MEMORY
            value: "2g"
          - name: CELEBORN_WORKER_OFFHEAP_MEMORY
            value: "12g"
          - name: TZ
            value: "Asia/Shanghai"
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            name: celeborn-conf
          name: celeborn-helm-volume
---
# Source: celeborn/templates/worker-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: celeborn-worker
  labels:
    app.kubernetes.io/name: celeborn-worker
    app.kubernetes.io/role: worker
    helm.sh/chart: celeborn-0.1.1
    app.kubernetes.io/instance: celeborn-helm
    app.kubernetes.io/version: "1.16.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: celeborn-worker
      app.kubernetes.io/role: worker
      app.kubernetes.io/instance: celeborn-helm
  serviceName: celeborn-worker
  replicas: 2
  template:
    metadata:
      labels:
        app.kubernetes.io/name: celeborn-worker
        app.kubernetes.io/role: worker
        app.kubernetes.io/instance: celeborn-helm
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - celeborn-worker
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: celeborn
        image: "remote-shuffle-service:0.1.1-6badd20"
        imagePullPolicy: IfNotPresent
        command:
          - "/usr/bin/tini"
          - "--"
          - "/bin/sh"
          - '-c'
          - "/opt/rss-0.1.3-bin-release/sbin/start-worker.sh && sleep 200"
        resources:
            null
        ports:
          - containerPort: 9098
            name: metrics
            protocol: TCP
        volumeMounts:
          - mountPath: /opt/rss-0.1.3-bin-release/conf
            name: celeborn-helm-volume
            readOnly: true
          - mountPath: /data/datarss1
            name: vol-0
          - mountPath: /data/datarss2
            name: vol-1
        env:
          - name: CELEBORN_MASTER_JAVA_OPTS
            value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
          - name: CELEBORN_MASTER_MEMORY
            value: "2g"
          - name: CELEBORN_NO_DAEMONIZE
            value: "yes"
          - name: CELEBORN_WORKER_JAVA_OPTS
            value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
          - name: CELEBORN_WORKER_MEMORY
            value: "2g"
          - name: CELEBORN_WORKER_OFFHEAP_MEMORY
            value: "12g"
          - name: TZ
            value: "Asia/Shanghai"
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            name: celeborn-conf
          name: celeborn-helm-volume
        - hostPath:
            path: /data/datarss1/worker
            type: DirectoryOrCreate
          name: vol-0
        - hostPath:
            path: /data/datarss2/worker
            type: DirectoryOrCreate
          name: vol-1

NOTES:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

Celeborn
[root@k8s01 docker]#
FMX commented 2 years ago

@jiangbiao910 If you want to use the helm, checkout the main branch commit hash(7a858bd7). The helm is configured with the branch 0.1.