Open zuston opened 1 year ago
It's a good issue. In fact, in some cases, the resources of the hosts are strictly allocated. A better thing is that we can test the performance of uniffle under different memory sizes. This can provides a reference for resource allocation for unifle server.
But, in my case, an error was reported when running the 1tb tpc-ds q4.sql test with the memory set to 4g for uniffle server, 2g for coordinator. Just limit the resource of uniffle, because compared to the sortshuffle(spark) , the resources of uniffle are additional. the confs are as follows:
test data: tpc-ds, generated by 1tb, query q4.sql, stored in hive2.1.1, spark2.4.4, uniffle0.7.1, hadoop3.2.1
spark-defaults.conf: spark.driver.memory 32g spark.driver.memoryOverhead 8g spark.executor.memory 64g spark.executor.memoryOverhead 16g spark.executor.instances 6 spark.driver.cores 16 spark.executor.cores 32 spark.sql.autoBroadcastJoinThreshold 2 spark.sql.broadcastTimeout 300 spark.executor.heartbeatInterval 60 spark.network.timeout 600 spark.rss.client.send.check.timeout.ms 1800000
coordinator.conf(We have modified the JVM_ARGS parameter in the start-coordinator. sh file to limit the coordinator resource to 2g, ): rss.coordinator.server.heartbeat.timeout 30000 rss.coordinator.server.periodic.output.interval.times 30 rss.coordinator.assignment.strategy PARTITION_BALANCE rss.coordinator.app.expired 60000 rss.coordinator.shuffle.nodes.max 3 rss.coordinator.exclude.nodes.file.path rss.coordinator.exclude.nodes.check.interval.ms 60000 rss.coordinator.access.checkers rss.coordinator.access.loadChecker.memory.percentage 5.0 rss.coordinator.dynamicClientConf.enabled true rss.coordinator.dynamicClientConf.path /etc/confpath/dynamic_client.conf rss.coordinator.dynamicClientConf.updateIntervalSec 120 rss.coordinator.remote.storage.cluster.conf rss.rpc.server.port 19995 rss.jetty.http.port 19994 rss.coordinator.remote.storage.select.strategy APP_BALANCE rss.coordinator.remote.storage.io.sample.schedule.time 60000 rss.coordinator.remote.storage.io.sample.file.size 204800000 rss.coordinator.remote.storage.io.sample.access.times 3 rss.coordinator.startup-silent-period.enabled false rss.coordinator.startup-silent-period.duration 20000
dynamic_client.conf: rss.storage.type MEMORY_LOCALFILE_HDFS rss.coordinator.remote.storage.path hdfs://ip:8020/tmp rss.writer.require.memory.retryMax 1200 rss.client.retry.max 50 rss.writer.send.check.timeout 600000 rss.client.read.buffer.size 14m rss.client.send.check.timeout.ms 1800000
server.conf: rss.rpc.server.port 19992 rss.jetty.http.port 19991 rss.rpc.executor.size 1000 rss.storage.basePath /mnt/ssd1/data/rssdata,/mnt/ssd2/data/rssdata,/mnt/ssd3/data/rssdata,/mnt/ssd4/data/rssdata,/mnt/ssd5/data/rssdata,/mnt/disk1/data/rssdata,/mnt/disk2/data/rssdata,/mnt/disk3/data/rssdata,/mnt/disk4/data/rssdata,/mnt/disk5/data/rssdata rss.server.flush.thread.alive 120 rss.server.heartbeat.timeout 60000 rss.rpc.message.max.size 1073741824 rss.server.preAllocation.expired 20000 rss.server.app.expired.withoutHeartbeat 60000 rss.server.buffer.capacity 3g rss.server.memory.shuffle.highWaterMark.percentage 10.0 rss.server.memory.shuffle.lowWaterMark.percentage 5.0 rss.server.read.buffer.capacity 512m rss.server.heartbeat.interval 10000 rss.server.flush.threadPool.size 10 rss.server.commit.timeout 600000 rss.storage.type MEMORY_LOCALFILE_HDFS rss.server.flush.cold.storage.threshold.size 32m rss.server.tags rss.server.single.buffer.flush.enabled true rss.server.single.buffer.flush.threshold 32m rss.server.disk.capacity -1 rss.server.localstorage.initialize.max.fail.number 1 rss.coordinator.access.loadChecker.serverNum.threshold 1 rss.coordinator.access.candidates.updateIntervalSec 120 rss.coordinator.access.candidates.path
rss-env.sh: RSS_HOME=/usr/lib/uniffle RSS_CONF_DIR=/etc/uniffle RSS_PID_DIR=/var/uniffle RSS_LOG_DIR=/mnt/disk1/log/uniffle JAVA_HOME=/usr/lib/jdk1.8 HADOOP_HOME=/usr/lib/hadoop XMX_SIZE=4g RSS_IP=0.0.0.0
uniffle server is normal, but spark thrown an exception, and the task ends, exts. org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 1.0 failed 4 times, most recent failure: Lost task 25.3 in stage 1.0 (TID 4604, xshbp225.bigdata.hikvision.com, executor 5): org.apache.uniffle.common.exception.RssException: Send failed: Task[4604_3] failed because 200 blocks can't be sent to shuffle server. at org.apache.spark.shuffle.writer.RssShuffleWriter.checkBlockSendResult(RssShuffleWriter.java:299) at org.apache.spark.shuffle.writer.RssShuffleWriter.writeImpl(RssShuffleWriter.java:194) at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:167) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
Some optimization operations(useless): set rss.server.buffer.capacity 3g or 2g rss.server.memory.shuffle.highWaterMark.percentage 10.0 or 20.0, 30.0 rss.server.memory.shuffle.lowWaterMark.percentage 5.0 or 10.0 rss.server.single.buffer.flush.threshold 32m or 64m, 16m, 4m rss.server.read.buffer.capacity 512m or 256m, 1g rss.server.single.buffer.flush.enabled true or false
Big workers,have any other parameter tuning or suggestions for uniflle under limited resource conditions? thanks
Thanks for your share. I will test it in tomorrow.
something wrong! Pay attention and attention to the conf of RSS_IP=0.0.0.0. Due to multiple network interface card env, i set rss_ip=0.0.0.0 to adapt(For multiple network interface card, some suggestions?), looks wrong.
Remove it and set rss.server.high.watermark.write 20 , rss.server.low.watermark.write 5, XMX_SIZE=4g. Now task is normal, but some executors died and slow. The result will be reported soon! Something confuses me.
I set XMX_SIZE=4g, but top -p process_id display RES 14.6g, bigger than 4g? U may make me clear! thanks. top -p process_id display as fellow.
top - 09:27:59 up 8 days, 29 min, 4 users, load average: 31.17, 32.44, 32.10 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 36.7 us, 4.9 sy, 0.0 ni, 53.5 id, 4.6 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 52776729+total, 11288404+free, 16546635+used, 24941689+buff/cache KiB Swap: 0 total, 0 free, 0 used. 36083699+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
297718 root 20 0 38.9g 14.6g 14.9m S 35.7 2.9 17:37.99 java
https://stackoverflow.com/questions/48798024/java-consumes-memory-more-than-xmx-argument Sometimes. the native and off-heap will occupy some memory, this is the java's problem.
I have same proposal in the low-memory shuffle server :
if write speed of client is under control, spark client support features like BackPressure or Sleep Seconds after write same records. This will prevent the server memory from growing too quickly.
these same configurations in the client to control the writing speed.
spark.rss.client.send.threadPool.size = 1
spark.rss.client.data.transfer.pool.size 1
spark.rss.client.retry.max 100
spark.rss.client.retry.interval.max 60000
spark.rss.client.send.check.timeout.ms 2100000
Conclusion: different memory sizes have little impact on the performance of uniffle. For reference only!
We used 1t tpc-ds, 3 nodes, and the configuration is as shown above. The test results are as follows, with a total of 10 SQL statements q (4 11 17 50 64 67 14 78 93 94). SQL.
The custom configuration is as follows:
spark.driver.memory 32g spark.driver.memoryOverhead 8g spark.executor.memory 64g spark.executor.memoryOverhead 16g spark.executor.instances 6 spark.driver.cores 16 spark.executor.cores 32
-Xmx4g
-XX:MaxDirectMemorySize=1g
XMX_SIZE=2g
rss.server.buffer.capacity=1400m
rss.server.read.buffer.capacity=700m
rss.server.memory.shuffle.highWaterMark.percentage="30.0"
rss.server.memory.shuffle.lowWaterMark.percentage="5.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m"
XMX_SIZE=4g
rss.server.buffer.capacity=2800m
rss.server.read.buffer.capacity=1400m
rss.server.memory.shuffle.highWaterMark.percentage="40.0"
rss.server.memory.shuffle.lowWaterMark.percentage="5.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m"
XMX_SIZE=8g
rss.server.buffer.capacity=5600m
rss.server.read.buffer.capacity=2800m
rss.server.memory.shuffle.highWaterMark.percentage="50.0"
rss.server.memory.shuffle.lowWaterMark.percentage="10.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m"
XMX_SIZE=16g
rss.server.buffer.capacity=11200m
rss.server.read.buffer.capacity=5600m
rss.server.memory.shuffle.highWaterMark.percentage="70.0"
rss.server.memory.shuffle.lowWaterMark.percentage="20.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"
XMX_SIZE=24g
rss.server.buffer.capacity= 16800m
rss.server.read.buffer.capacity=8400m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"
XMX_SIZE=32g
rss.server.buffer.capacity= 22400m
rss.server.read.buffer.capacity= 11200m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"
XMX_SIZE=64g
rss.server.buffer.capacity= 44800m
rss.server.read.buffer.capacity= 22400m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"
I think we need to describe more sections to how to configure the client/server config on a low memory config.
From my point of view, once you configure lower memory, you have to accept the slowdown caused by back pressure when running more. As we all know, Uniffle uses memory as a high-speed buffer, but once the buffer is full, the rate of receiving shuffle data will be reduced to the highest write rate of the disk, so it will be back pressured.
In this case, we have two ways to avoid task failure
rss.server.high.watermark.write 20
,rss.server.low.watermark.write 10
rss.client.send.check.timeout.ms 1200000