Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

alluxio worker OOMKilled #18681

Open XiXiTan opened 2 months ago

XiXiTan commented 2 months ago

Alluxio Version: What version of Alluxio are you using? 2.9.0.1

Describe the bug A clear and concise description of what the bug is. 内存设置有富裕,但worker pod会出现被OOMKilled情况。 请教可能是哪块儿内存使用超出预期?以及缓存为啥会用超过设置的取值?

pod申请资源: cpu: 4 memory: 16G 使用资源: xmx=4g MaxDirectMemorySize=4g alluxio.worker.ramdisk.size=6g 预留内存=2g

具体内存设置: /usr/lib/jvm/java-1.8.0-openjdk/bin/java -cp /opt/alluxio-2.9.0.1-noHelm/conf/::/opt/alluxio/ranger-lib/*:/opt/alluxio-2.9.0.1-noHelm/assembly/alluxio-server-2.9.0.1.jar -Dalluxio.logger.type=Console,WORKER_LOGGER -Dsun.security.krb5.disableReferrals=true -Dalluxio.home=/opt/alluxio-2.9.0.1-noHelm -Dalluxio.conf.dir=/opt/alluxio-2.9.0.1-noHelm/conf -Dalluxio.logs.dir=/opt/alluxio-2.9.0.1-noHelm/logs -Dalluxio.user.logs.dir=/opt/alluxio-2.9.0.1-noHelm/logs/user -Dlog4j.configuration=file:/opt/alluxio-2.9.0.1-noHelm/conf/log4j.properties -Dorg.apache.jasper.compiler.disablejsr199=true -Djava.net.preferIPv4Stack=true -Dorg.apache.ratis.thirdparty.io.netty.allocator.useCacheForAllThreads=false -Dalluxio.worker.hostname=ip -Xmx4096M -XX:MaxDirectMemorySize=4096M alluxio.worker.AlluxioWorker

conf/alluxio-site.properties alluxio.worker.ramdisk.size=6144M

缓存使用:

缓存使用

出问题pod的cpu、mem情况 worker cpu

worker mem

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen. worker pod不要OOMKilled

Urgency Describe the impact and urgency of the bug.

Are you planning to fix it Please indicate if you are already working on a PR.

Additional context Add any other context about the problem here.

XiXiTan commented 2 months ago

另一个小问题: 如果worker缓存设置为512M,实际会使用1024M。这超过了缓存设置512,和woker缓存使用预期不符。 MEM HDD capacity 30.50GB 512.00MB 30.00GB used 4083.94MB (13%) 1024.00MB 3059.94MB

源码中只看到对于未设定缓存时的默认值,会取系统获取2/3内存或者给1g。没有看到对于指定缓存时,会取其他值的逻辑。

`

public static final PropertyKey WORKER_RAMDISK_SIZE = dataSizeBuilder(Name.WORKER_RAMDISK_SIZE) .setAlias(Name.WORKER_MEMORY_SIZE) .setDefaultSupplier(() -> { try { OperatingSystemMXBean operatingSystemMXBean = (OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean(); return operatingSystemMXBean.getTotalPhysicalMemorySize() * 2 / 3; } catch (Throwable e) { // The package com.sun.management may not be available on every platform. // fallback to a reasonable size. return "1GB"; } }, "2/3 of total system memory, or 1GB if system memory size cannot be determined") .setDescription("The allocated memory for each worker node's ramdisk(s). "

`