Closed kettlelinna closed 2 years ago
Some information of my env maybe help u
By the way, when I limit below information, the worker can work
@kettlelinna
You can disable disk monitor by setting rss.device.monitor.enabled=false
.
We'll fix this later.
rss.devi @FMX when I set rss.device.monitor.enabled=false But
22/11/01 10:17:00,788 ERROR [main] DeviceMonitor: Device monitor init failed. java.lang.NullPointerException at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120) at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174) at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266) at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:221) at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:80) at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:994) at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala) Exception in thread "main" java.lang.NullPointerException at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120) at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174) at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266) at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:221) at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:80) at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:994) at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
version 0.1.1
@jiangbiao910 Hi, I think you can try the version v0.1.3. The version v0.1.1 is released at july 22.
@jiangbiao910 Hi, I think you can try the version v0.1.3. The version v0.1.1 is released at july 22. The virtual machine is running properly and the k8s deployment is this error. thanks ,I will try 0.1.3 immediately
@FMX HI,I mvn code on windows ,run on centos /root/helm-celeborn/celeborn-main/docker/rss-0.1.3-bin-release/conf/rss-env.sh: line 2: $'\r': command not found /root/helm-celeborn/celeborn-main/docker/rss-0.1.3-bin-release/conf/rss-env.sh: line 10: $'\r': command not found dos2unix xxx may can work. Is there any other way?
@jiangbiao910 I think you can use the pre-built release package or packaging it on Linux.
@jiangbiao910 I think you can use the pre-built release package or packaging it on Linux.
when I try 0.1.3 on k8s
root@celeborn-worker-0:/opt/celeborn/logs# tail -100f rss--com.aliyun.emr.rss.service.deploy.worker.Worker-1-celeborn-worker-0.out
Using Spark's default log4j profile: log4j-defaults.properties
22/11/01 12:32:29,081 INFO [main] Dispatcher: Dispatcher numThreads: 64
22/11/01 12:32:29,095 DEBUG [main] InternalLoggerFactory: Using SLF4J as the default logging framework
22/11/01 12:32:29,096 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024
22/11/01 12:32:29,096 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096
22/11/01 12:32:29,103 INFO [main] TransportClientFactory: mode NIO threads 64
22/11/01 12:32:29,105 DEBUG [main] MultithreadEventLoopGroup: -Dio.netty.eventLoopThreads: 2
22/11/01 12:32:29,130 DEBUG [main] PlatformDependent0: -Dio.netty.noUnsafe: false
22/11/01 12:32:29,130 DEBUG [main] PlatformDependent0: Java version: 8
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.theUnsafe: available
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.copyMemory: available
22/11/01 12:32:29,131 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.storeFence: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.Buffer.address: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: direct buffer constructor: available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.Bits.unaligned: available, true
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent0: java.nio.DirectByteBuffer.<init>(long, int): available
22/11/01 12:32:29,132 DEBUG [main] PlatformDependent: sun.misc.Unsafe: available
22/11/01 12:32:29,133 DEBUG [main] PlatformDependent: -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
22/11/01 12:32:29,133 DEBUG [main] PlatformDependent: -Dio.netty.bitMode: 64 (sun.arch.data.model)
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.maxDirectMemory: 1073741824 bytes
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.uninitializedArrayAllocationThreshold: -1
22/11/01 12:32:29,134 DEBUG [main] CleanerJava6: java.nio.ByteBuffer.cleaner(): available
22/11/01 12:32:29,134 DEBUG [main] PlatformDependent: -Dio.netty.noPreferDirect: false
22/11/01 12:32:29,135 DEBUG [main] NioEventLoop: -Dio.netty.noKeySetOptimization: false
22/11/01 12:32:29,135 DEBUG [main] NioEventLoop: -Dio.netty.selectorAutoRebuildThreshold: 512
22/11/01 12:32:29,139 DEBUG [main] PlatformDependent: org.jctools-core.MpscChunkedArrayQueue: available
22/11/01 12:32:29,162 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
22/11/01 12:32:29,162 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.targetRecords: 4
22/11/01 12:32:29,163 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numHeapArenas: 2
22/11/01 12:32:29,163 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numDirectArenas: 2
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.pageSize: 8192
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxOrder: 9
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.chunkSize: 4194304
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.smallCacheSize: 256
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.normalCacheSize: 64
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedBufferCapacity: 32768
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimInterval: 8192
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimIntervalMillis: 0
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.useCacheForAllThreads: false
22/11/01 12:32:29,164 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
22/11/01 12:32:29,193 DEBUG [main] DefaultChannelId: -Dio.netty.processId: 16 (auto-detected)
22/11/01 12:32:29,194 DEBUG [main] NetUtil: -Djava.net.preferIPv4Stack: false
22/11/01 12:32:29,195 DEBUG [main] NetUtil: -Djava.net.preferIPv6Addresses: false
22/11/01 12:32:29,196 DEBUG [main] NetUtilInitializations: Loopback interface: lo (lo, 127.0.0.1)
22/11/01 12:32:29,196 DEBUG [main] NetUtil: /proc/sys/net/core/somaxconn: 4096
22/11/01 12:32:29,197 DEBUG [main] DefaultChannelId: -Dio.netty.machineId: 4e:98:e4:ff:fe:6d:63:c9 (auto-detected)
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.allocator.type: pooled
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.threadLocalDirectBufferSize: 0
22/11/01 12:32:29,206 DEBUG [main] ByteBufUtil: -Dio.netty.maxThreadLocalCharBufferSize: 16384
22/11/01 12:32:29,221 DEBUG [main] TransportServer: Shuffle server started on port: 42007
22/11/01 12:32:29,223 INFO [main] Utils: Successfully started service 'WorkerSys' on port 42007.
22/11/01 12:32:29,332 ERROR [main] DeviceMonitor: Device monitor init failed.
java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
Exception in thread "main" java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
rss.device.monitor.enabled false this two setting both error.
@jiangbiao910 I think you can check your rss configuration file first. If you set rss.device.monitor.enabled = false, the device monitor will not initialize.
@jiangbiao910 If you have a DingTalk account, you can connect with us by joining the group below.
@jiangbiao910 I think you can check your rss configuration file first. If you set rss.device.monitor.enabled = false, the device monitor will not initialize.
But it still error
I think you might have misconfigured. Can you share your complete worker log?
I think you might have misconfigured. Can you share your complete worker log?
[root@celeborn-worker-0 logs]# tail -100f rss--com.aliyun.emr.rss.service.deploy.worker.Worker-1-celeborn-worker-0.out
Using Spark's default log4j profile: log4j-defaults.properties
22/11/01 13:51:37,075 INFO [main] Dispatcher: Dispatcher numThreads: 64
22/11/01 13:51:37,123 DEBUG [main] InternalLoggerFactory: Using SLF4J as the default logging framework
22/11/01 13:51:37,124 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.initialSize: 1024
22/11/01 13:51:37,124 DEBUG [main] InternalThreadLocalMap: -Dio.netty.threadLocalMap.stringBuilder.maxSize: 4096
22/11/01 13:51:37,131 INFO [main] TransportClientFactory: mode NIO threads 64
22/11/01 13:51:37,140 DEBUG [main] MultithreadEventLoopGroup: -Dio.netty.eventLoopThreads: 2
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: -Dio.netty.noUnsafe: false
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: Java version: 8
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.theUnsafe: available
22/11/01 13:51:37,183 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.copyMemory: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: sun.misc.Unsafe.storeFence: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: java.nio.Buffer.address: available
22/11/01 13:51:37,184 DEBUG [main] PlatformDependent0: direct buffer constructor: available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: java.nio.Bits.unaligned: available, true
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: jdk.internal.misc.Unsafe.allocateUninitializedArray(int): unavailable prior to Java9
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent0: java.nio.DirectByteBuffer.<init>(long, int): available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: sun.misc.Unsafe: available
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: -Dio.netty.tmpdir: /tmp (java.io.tmpdir)
22/11/01 13:51:37,185 DEBUG [main] PlatformDependent: -Dio.netty.bitMode: 64 (sun.arch.data.model)
22/11/01 13:51:37,186 DEBUG [main] PlatformDependent: -Dio.netty.maxDirectMemory: 1073741824 bytes
22/11/01 13:51:37,186 DEBUG [main] PlatformDependent: -Dio.netty.uninitializedArrayAllocationThreshold: -1
22/11/01 13:51:37,186 DEBUG [main] CleanerJava6: java.nio.ByteBuffer.cleaner(): available
22/11/01 13:51:37,187 DEBUG [main] PlatformDependent: -Dio.netty.noPreferDirect: false
22/11/01 13:51:37,188 DEBUG [main] NioEventLoop: -Dio.netty.noKeySetOptimization: false
22/11/01 13:51:37,188 DEBUG [main] NioEventLoop: -Dio.netty.selectorAutoRebuildThreshold: 512
22/11/01 13:51:37,191 DEBUG [main] PlatformDependent: org.jctools-core.MpscChunkedArrayQueue: available
22/11/01 13:51:37,204 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.level: simple
22/11/01 13:51:37,204 DEBUG [main] ResourceLeakDetector: -Dio.netty.leakDetection.targetRecords: 4
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numHeapArenas: 2
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.numDirectArenas: 2
22/11/01 13:51:37,206 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.pageSize: 8192
22/11/01 13:51:37,207 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxOrder: 9
22/11/01 13:51:37,208 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.chunkSize: 4194304
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.smallCacheSize: 256
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.normalCacheSize: 64
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedBufferCapacity: 32768
22/11/01 13:51:37,209 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimInterval: 8192
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.cacheTrimIntervalMillis: 0
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.useCacheForAllThreads: false
22/11/01 13:51:37,212 DEBUG [main] PooledByteBufAllocator: -Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
22/11/01 13:51:37,247 DEBUG [main] DefaultChannelId: -Dio.netty.processId: 17 (auto-detected)
22/11/01 13:51:37,248 DEBUG [main] NetUtil: -Djava.net.preferIPv4Stack: false
22/11/01 13:51:37,249 DEBUG [main] NetUtil: -Djava.net.preferIPv6Addresses: false
22/11/01 13:51:37,250 DEBUG [main] NetUtilInitializations: Loopback interface: lo (lo, 127.0.0.1)
22/11/01 13:51:37,250 DEBUG [main] NetUtil: /proc/sys/net/core/somaxconn: 4096
22/11/01 13:51:37,251 DEBUG [main] DefaultChannelId: -Dio.netty.machineId: 1e:df:df:ff:fe:aa:fd:e2 (auto-detected)
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.allocator.type: pooled
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.threadLocalDirectBufferSize: 0
22/11/01 13:51:37,261 DEBUG [main] ByteBufUtil: -Dio.netty.maxThreadLocalCharBufferSize: 16384
22/11/01 13:51:37,282 DEBUG [main] TransportServer: Shuffle server started on port: 41987
22/11/01 13:51:37,288 INFO [main] Utils: Successfully started service 'WorkerSys' on port 41987.
22/11/01 13:51:37,469 ERROR [main] DeviceMonitor: Device monitor init failed.
java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
Exception in thread "main" java.lang.NullPointerException
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.$anonfun$getDeviceAndMountInfos$4(DeviceInfo.scala:123)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at com.aliyun.emr.rss.service.deploy.worker.DeviceInfo$.getDeviceAndMountInfos(DeviceInfo.scala:120)
at com.aliyun.emr.rss.service.deploy.worker.LocalDeviceMonitor.init(DeviceMonitor.scala:174)
at com.aliyun.emr.rss.service.deploy.worker.DeviceMonitor$.createDeviceMonitor(DeviceMonitor.scala:266)
at com.aliyun.emr.rss.service.deploy.worker.LocalStorageManager.<init>(LocalStorageManager.scala:223)
at com.aliyun.emr.rss.service.deploy.worker.Worker.<init>(Worker.scala:71)
at com.aliyun.emr.rss.service.deploy.worker.Worker$.main(Worker.scala:380)
at com.aliyun.emr.rss.service.deploy.worker.Worker.main(Worker.scala)
command terminated with exit code 137
[root@k8s01 docker]# helm install celeborn-helm helm1 -n rss --dry-run
NAME: celeborn-helm
LAST DEPLOYED: Tue Nov 1 14:37:15 2022
NAMESPACE: rss
STATUS: pending-install
REVISION: 1
TEST SUITE: None
HOOKS:
MANIFEST:
---
# Source: celeborn/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: celeborn-conf
labels:
helm.sh/chart: celeborn-0.1.1
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/version: "1.16.0"
app.kubernetes.io/managed-by: Helm
data:
celeborn-defaults.conf: |-
rss.master.address celeborn-master-0.celeborn-master-svc.rss.svc.cluster.local:9097
rss.metrics.system.enabled true
rss.worker.flush.buffer.size 256k
rss.worker.flush.queue.capacity 4096
rss.worker.base.dirs /data/datarss1,/data/datarss2
# If your hosts have disk raid or use lvm, set rss.device.monitor.enabled to false
rss.device.monitor.enabled false
rss-env.sh: |
CELEBORN_MASTER_JAVA_OPTS="-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
CELEBORN_MASTER_MEMORY="2g"
CELEBORN_NO_DAEMONIZE="yes"
CELEBORN_WORKER_JAVA_OPTS="-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
CELEBORN_WORKER_MEMORY="2g"
CELEBORN_WORKER_OFFHEAP_MEMORY="12g"
TZ="Asia/Shanghai"
log4j-defaults.properties: |-
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss,SSS} %p [%t] %c{1}: %m%n
# metrics.properties: >-
# *.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
---
# Source: celeborn/templates/master-service.yaml
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
apiVersion: v1
kind: Service
metadata:
name: celeborn-master-svc
labels:
helm.sh/chart: celeborn-0.1.1
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/version: "1.16.0"
app.kubernetes.io/managed-by: Helm
spec:
type: ClusterIP
ports:
- port: 9097
targetPort: 9097
protocol: TCP
name: celeborn-master
clusterIP: None
selector:
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/name: celeborn-master
app.kubernetes.io/role: master
---
# Source: celeborn/templates/worker-service.yaml
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
apiVersion: v1
kind: Service
metadata:
name: celeborn-worker-svc
labels:
helm.sh/chart: celeborn-0.1.1
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/version: "1.16.0"
app.kubernetes.io/managed-by: Helm
spec:
type: ClusterIP
clusterIP: None
selector:
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/name: celeborn-worker
app.kubernetes.io/role: worker
---
# Source: celeborn/templates/master-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: celeborn-master
labels:
app.kubernetes.io/name: celeborn-master
app.kubernetes.io/role: master
helm.sh/chart: celeborn-0.1.1
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/version: "1.16.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: celeborn-master
app.kubernetes.io/role: master
app.kubernetes.io/instance: celeborn-helm
serviceName: celeborn-master-svc
replicas: 1
template:
metadata:
labels:
app.kubernetes.io/name: celeborn-master
app.kubernetes.io/role: master
app.kubernetes.io/instance: celeborn-helm
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- celeborn-master
topologyKey: kubernetes.io/hostname
containers:
- name: celeborn
image: "remote-shuffle-service:0.1.1-6badd20"
imagePullPolicy: IfNotPresent
command:
- "/usr/bin/tini"
- "--"
- "/bin/sh"
- '-c'
- "/opt/rss-0.1.3-bin-release/sbin/start-master.sh && sleep 200"
resources:
null
ports:
- containerPort: 9097
- containerPort: 9098
name: metrics
protocol: TCP
volumeMounts:
- mountPath: /opt/rss-0.1.3-bin-release/conf
name: celeborn-helm-volume
readOnly: true
env:
- name: CELEBORN_MASTER_JAVA_OPTS
value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
- name: CELEBORN_MASTER_MEMORY
value: "2g"
- name: CELEBORN_NO_DAEMONIZE
value: "yes"
- name: CELEBORN_WORKER_JAVA_OPTS
value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
- name: CELEBORN_WORKER_MEMORY
value: "2g"
- name: CELEBORN_WORKER_OFFHEAP_MEMORY
value: "12g"
- name: TZ
value: "Asia/Shanghai"
terminationGracePeriodSeconds: 30
volumes:
- configMap:
name: celeborn-conf
name: celeborn-helm-volume
---
# Source: celeborn/templates/worker-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: celeborn-worker
labels:
app.kubernetes.io/name: celeborn-worker
app.kubernetes.io/role: worker
helm.sh/chart: celeborn-0.1.1
app.kubernetes.io/instance: celeborn-helm
app.kubernetes.io/version: "1.16.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: celeborn-worker
app.kubernetes.io/role: worker
app.kubernetes.io/instance: celeborn-helm
serviceName: celeborn-worker
replicas: 2
template:
metadata:
labels:
app.kubernetes.io/name: celeborn-worker
app.kubernetes.io/role: worker
app.kubernetes.io/instance: celeborn-helm
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- celeborn-worker
topologyKey: "kubernetes.io/hostname"
containers:
- name: celeborn
image: "remote-shuffle-service:0.1.1-6badd20"
imagePullPolicy: IfNotPresent
command:
- "/usr/bin/tini"
- "--"
- "/bin/sh"
- '-c'
- "/opt/rss-0.1.3-bin-release/sbin/start-worker.sh && sleep 200"
resources:
null
ports:
- containerPort: 9098
name: metrics
protocol: TCP
volumeMounts:
- mountPath: /opt/rss-0.1.3-bin-release/conf
name: celeborn-helm-volume
readOnly: true
- mountPath: /data/datarss1
name: vol-0
- mountPath: /data/datarss2
name: vol-1
env:
- name: CELEBORN_MASTER_JAVA_OPTS
value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-master.out -Dio.netty.leakDetectionLevel=advanced"
- name: CELEBORN_MASTER_MEMORY
value: "2g"
- name: CELEBORN_NO_DAEMONIZE
value: "yes"
- name: CELEBORN_WORKER_JAVA_OPTS
value: "-XX:-PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:gc-worker.out -Dio.netty.leakDetectionLevel=advanced"
- name: CELEBORN_WORKER_MEMORY
value: "2g"
- name: CELEBORN_WORKER_OFFHEAP_MEMORY
value: "12g"
- name: TZ
value: "Asia/Shanghai"
terminationGracePeriodSeconds: 30
volumes:
- configMap:
name: celeborn-conf
name: celeborn-helm-volume
- hostPath:
path: /data/datarss1/worker
type: DirectoryOrCreate
name: vol-0
- hostPath:
path: /data/datarss2/worker
type: DirectoryOrCreate
name: vol-1
NOTES:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
Celeborn
[root@k8s01 docker]#
@jiangbiao910 If you want to use the helm, checkout the main branch commit hash(7a858bd7). The helm is configured with the branch 0.1.
What is the bug?
Throw NullPointerException when I start worker on node which use LVM. The way I see it, com.aliyun.emr.rss.service.deploy.worker.DeviceInfo#getDeviceAndMountInfos not support LVM format, it can't get correct mount information.
How to reproduce the bug?
start worker by
$RSS_HOME/sbin/start-worker.sh rss://node01:9097
on node wich use LVM formatCould you share logs or screenshots?
/cc @who-need-to-know
/assign @who-can-solve-this-bug