apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
384 stars 149 forks source link

[DOCS] shuffle-server with low-memory best practice #1169

Open zuston opened 1 year ago

zuston commented 1 year ago

I think we need to describe more sections to how to configure the client/server config on a low memory config.

From my point of view, once you configure lower memory, you have to accept the slowdown caused by back pressure when running more. As we all know, Uniffle uses memory as a high-speed buffer, but once the buffer is full, the rate of receiving shuffle data will be reduced to the highest write rate of the disk, so it will be back pressured.

In this case, we have two ways to avoid task failure

  1. make the lower watermark for memory store. such as rss.server.high.watermark.write 20 , rss.server.low.watermark.write 10
  2. increase the client timeout. rss.client.send.check.timeout.ms 1200000
kunzeng commented 1 year ago

It's a good issue. In fact, in some cases, the resources of the hosts are strictly allocated. A better thing is that we can test the performance of uniffle under different memory sizes. This can provides a reference for resource allocation for unifle server.

But, in my case, an error was reported when running the 1tb tpc-ds q4.sql test with the memory set to 4g for uniffle server, 2g for coordinator. Just limit the resource of uniffle, because compared to the sortshuffle(spark) , the resources of uniffle are additional. the confs are as follows:

test data: tpc-ds, generated by 1tb, query q4.sql, stored in hive2.1.1, spark2.4.4, uniffle0.7.1, hadoop3.2.1

spark-defaults.conf: spark.driver.memory 32g spark.driver.memoryOverhead 8g spark.executor.memory 64g spark.executor.memoryOverhead 16g spark.executor.instances 6 spark.driver.cores 16 spark.executor.cores 32 spark.sql.autoBroadcastJoinThreshold 2 spark.sql.broadcastTimeout 300 spark.executor.heartbeatInterval 60 spark.network.timeout 600 spark.rss.client.send.check.timeout.ms 1800000

coordinator.conf(We have modified the JVM_ARGS parameter in the start-coordinator. sh file to limit the coordinator resource to 2g, ): rss.coordinator.server.heartbeat.timeout 30000 rss.coordinator.server.periodic.output.interval.times 30 rss.coordinator.assignment.strategy PARTITION_BALANCE rss.coordinator.app.expired 60000 rss.coordinator.shuffle.nodes.max 3 rss.coordinator.exclude.nodes.file.path rss.coordinator.exclude.nodes.check.interval.ms 60000 rss.coordinator.access.checkers rss.coordinator.access.loadChecker.memory.percentage 5.0 rss.coordinator.dynamicClientConf.enabled true rss.coordinator.dynamicClientConf.path /etc/confpath/dynamic_client.conf rss.coordinator.dynamicClientConf.updateIntervalSec 120 rss.coordinator.remote.storage.cluster.conf rss.rpc.server.port 19995 rss.jetty.http.port 19994 rss.coordinator.remote.storage.select.strategy APP_BALANCE rss.coordinator.remote.storage.io.sample.schedule.time 60000 rss.coordinator.remote.storage.io.sample.file.size 204800000 rss.coordinator.remote.storage.io.sample.access.times 3 rss.coordinator.startup-silent-period.enabled false rss.coordinator.startup-silent-period.duration 20000

dynamic_client.conf: rss.storage.type MEMORY_LOCALFILE_HDFS rss.coordinator.remote.storage.path hdfs://ip:8020/tmp rss.writer.require.memory.retryMax 1200 rss.client.retry.max 50 rss.writer.send.check.timeout 600000 rss.client.read.buffer.size 14m rss.client.send.check.timeout.ms 1800000

server.conf: rss.rpc.server.port 19992 rss.jetty.http.port 19991 rss.rpc.executor.size 1000 rss.storage.basePath /mnt/ssd1/data/rssdata,/mnt/ssd2/data/rssdata,/mnt/ssd3/data/rssdata,/mnt/ssd4/data/rssdata,/mnt/ssd5/data/rssdata,/mnt/disk1/data/rssdata,/mnt/disk2/data/rssdata,/mnt/disk3/data/rssdata,/mnt/disk4/data/rssdata,/mnt/disk5/data/rssdata rss.server.flush.thread.alive 120 rss.server.heartbeat.timeout 60000 rss.rpc.message.max.size 1073741824 rss.server.preAllocation.expired 20000 rss.server.app.expired.withoutHeartbeat 60000 rss.server.buffer.capacity 3g rss.server.memory.shuffle.highWaterMark.percentage 10.0 rss.server.memory.shuffle.lowWaterMark.percentage 5.0 rss.server.read.buffer.capacity 512m rss.server.heartbeat.interval 10000 rss.server.flush.threadPool.size 10 rss.server.commit.timeout 600000 rss.storage.type MEMORY_LOCALFILE_HDFS rss.server.flush.cold.storage.threshold.size 32m rss.server.tags rss.server.single.buffer.flush.enabled true rss.server.single.buffer.flush.threshold 32m rss.server.disk.capacity -1 rss.server.localstorage.initialize.max.fail.number 1 rss.coordinator.access.loadChecker.serverNum.threshold 1 rss.coordinator.access.candidates.updateIntervalSec 120 rss.coordinator.access.candidates.path

rss-env.sh: RSS_HOME=/usr/lib/uniffle RSS_CONF_DIR=/etc/uniffle RSS_PID_DIR=/var/uniffle RSS_LOG_DIR=/mnt/disk1/log/uniffle JAVA_HOME=/usr/lib/jdk1.8 HADOOP_HOME=/usr/lib/hadoop XMX_SIZE=4g RSS_IP=0.0.0.0

uniffle server is normal, but spark thrown an exception, and the task ends, exts. org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 1.0 failed 4 times, most recent failure: Lost task 25.3 in stage 1.0 (TID 4604, xshbp225.bigdata.hikvision.com, executor 5): org.apache.uniffle.common.exception.RssException: Send failed: Task[4604_3] failed because 200 blocks can't be sent to shuffle server. at org.apache.spark.shuffle.writer.RssShuffleWriter.checkBlockSendResult(RssShuffleWriter.java:299) at org.apache.spark.shuffle.writer.RssShuffleWriter.writeImpl(RssShuffleWriter.java:194) at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:167) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

Some optimization operations(useless): set rss.server.buffer.capacity 3g or 2g rss.server.memory.shuffle.highWaterMark.percentage 10.0 or 20.0, 30.0 rss.server.memory.shuffle.lowWaterMark.percentage 5.0 or 10.0 rss.server.single.buffer.flush.threshold 32m or 64m, 16m, 4m rss.server.read.buffer.capacity 512m or 256m, 1g rss.server.single.buffer.flush.enabled true or false

Big workers,have any other parameter tuning or suggestions for uniflle under limited resource conditions? thanks

zuston commented 1 year ago

Thanks for your share. I will test it in tomorrow.

kunzeng commented 1 year ago

something wrong! Pay attention and attention to the conf of RSS_IP=0.0.0.0. Due to multiple network interface card env, i set rss_ip=0.0.0.0 to adapt(For multiple network interface card, some suggestions?), looks wrong.

Remove it and set rss.server.high.watermark.write 20 , rss.server.low.watermark.write 5, XMX_SIZE=4g. Now task is normal, but some executors died and slow. The result will be reported soon! Something confuses me.

I set XMX_SIZE=4g, but top -p process_id display RES 14.6g, bigger than 4g? U may make me clear! thanks. top -p process_id display as fellow.

top - 09:27:59 up 8 days, 29 min, 4 users, load average: 31.17, 32.44, 32.10 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 36.7 us, 4.9 sy, 0.0 ni, 53.5 id, 4.6 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem : 52776729+total, 11288404+free, 16546635+used, 24941689+buff/cache KiB Swap: 0 total, 0 free, 0 used. 36083699+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
297718 root 20 0 38.9g 14.6g 14.9m S 35.7 2.9 17:37.99 java

zuston commented 1 year ago

https://stackoverflow.com/questions/48798024/java-consumes-memory-more-than-xmx-argument Sometimes. the native and off-heap will occupy some memory, this is the java's problem.

xumanbu commented 1 year ago

I have same proposal in the low-memory shuffle server :

if write speed of client is under control, spark client support features like BackPressure or Sleep Seconds after write same records. This will prevent the server memory from growing too quickly.

these same configurations in the client to control the writing speed.

spark.rss.client.send.threadPool.size = 1
spark.rss.client.data.transfer.pool.size 1
spark.rss.client.retry.max 100
spark.rss.client.retry.interval.max 60000
spark.rss.client.send.check.timeout.ms 2100000
kunzeng commented 1 year ago

Conclusion: different memory sizes have little impact on the performance of uniffle. For reference only!

We used 1t tpc-ds, 3 nodes, and the configuration is as shown above. The test results are as follows, with a total of 10 SQL statements q (4 11 17 50 64 67 14 78 93 94). SQL. image

The custom configuration is as follows:

spark-defaults.conf

spark.driver.memory 32g spark.driver.memoryOverhead 8g spark.executor.memory 64g spark.executor.memoryOverhead 16g spark.executor.instances 6 spark.driver.cores 16 spark.executor.cores 32

coordinator.conf

-Xmx4g

server.conf

-XX:MaxDirectMemorySize=1g

2g conf

XMX_SIZE=2g
rss.server.buffer.capacity=1400m
rss.server.read.buffer.capacity=700m
rss.server.memory.shuffle.highWaterMark.percentage="30.0"
rss.server.memory.shuffle.lowWaterMark.percentage="5.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m" 

4g conf

XMX_SIZE=4g
rss.server.buffer.capacity=2800m
rss.server.read.buffer.capacity=1400m
rss.server.memory.shuffle.highWaterMark.percentage="40.0"
rss.server.memory.shuffle.lowWaterMark.percentage="5.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m"

8g server.conf

XMX_SIZE=8g
rss.server.buffer.capacity=5600m
rss.server.read.buffer.capacity=2800m
rss.server.memory.shuffle.highWaterMark.percentage="50.0"
rss.server.memory.shuffle.lowWaterMark.percentage="10.0"
rss.server.single.buffer.flush.threshold="32m"
rss.server.flush.cold.storage.threshold.size="32m"

16g server.conf

XMX_SIZE=16g
rss.server.buffer.capacity=11200m
rss.server.read.buffer.capacity=5600m
rss.server.memory.shuffle.highWaterMark.percentage="70.0"
rss.server.memory.shuffle.lowWaterMark.percentage="20.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"

24g server.conf

XMX_SIZE=24g
rss.server.buffer.capacity= 16800m
rss.server.read.buffer.capacity=8400m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"

32g conf

XMX_SIZE=32g
rss.server.buffer.capacity= 22400m
rss.server.read.buffer.capacity= 11200m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"

rss.server.flush.cold.storage.threshold.size="64m"

64g conf

XMX_SIZE=64g
rss.server.buffer.capacity= 44800m
rss.server.read.buffer.capacity= 22400m
rss.server.memory.shuffle.highWaterMark.percentage="75.0"
rss.server.memory.shuffle.lowWaterMark.percentage="25.0"
rss.server.single.buffer.flush.threshold="64m"
rss.server.flush.cold.storage.threshold.size="64m"