apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
373 stars 144 forks source link

[Bug] shuffle server have blocked threads #1026

Open smlHao opened 1 year ago

smlHao commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

@jerqi @zuston

hi, when huge table join huge table, shuffle server have blocked threads , Is it right?

server conf :
rss.rpc.server.port 20000 rss.jetty.http.port 20001 rss.storage.basePath /app/rss-0.7.1/data rss.storage.type MEMORY_LOCALFILE_HDFS rss.coordinator.quorum 172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999 rss.server.disk.capacity 50g

rss.server.flush.thread.alive 30 rss.server.flush.threadPool.size 10 rss.server.buffer.capacity 40g rss.server.read.buffer.capacity 20g rss.server.heartbeat.interval 10000 rss.rpc.message.max.size 1073741824 rss.server.preAllocation.expired 120000 rss.server.commit.timeout 600000 rss.server.app.expired.withoutHeartbeat 120000 rss.server.flush.cold.storage.threshold.size 512m

rss client conf :

spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager spark.rss.coordinator.quorum=172.100.3.70:19999,172.100.3.71:19999,172.100.3.72:19999

spark.rss.storage.type=MEMORY_LOCALFILE_HDFS spark.rss.remote.storage.path=hdfs://ns1/rss/sml

image

image

image

the executor have no daemon thread holding and hava no error log

image

image

Affects Version(s)

0.7.1

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

zuston commented 1 year ago

What do you want to report? The app failed?

smlHao commented 1 year ago

@zuston hi , thanks !!!

What do you want to report 1 . when huge table join huge table, shuffle server have blocked threads , Is it right?
2 . the executor have no daemon thread and holding , seem to hold on senddata to uniffle server , is my conf need to adjust ? The app failed ? after running 2 hours, seem no failed but executors holding , executor logs not update ,driver log only have uniffle heatbeat log

Do you have performance tuning about spark sql huge table join ?

zuston commented 1 year ago

No. But I have tuning with the huge partition. Firstly, we should find out the root cause. Please tell what happened for you app.

smlHao commented 1 year ago

No. But I have tuning with the huge partition. Firstly, we should find out the root cause. Please tell what happened for you app.

@zuston thanks !!! yes, you are right , I found that :

executor log long time no update : image

then I check the executor stack , find there are no daemon threads WAITTING, seem to holding on senddata:

image

then I analysis the shuffle server stack , find there are threads BLOCKED : image

myapp process seem no change, But I can`t find out the root cause , Do you have some steps to help me ?

tuning with the huge partition Do you have documents help me do this ?

zuston commented 1 year ago

Can you check the shuffle-server and executor GC? Why not using the spark ui? And I think if you want to analysis, it's better to show the metrics into dashboard.

smlHao commented 1 year ago

Can you check the shuffle-server and executor GC? Why not using the spark ui? And I think if you want to analysis, it's better to show the metrics into dashboard. 1 . check the shuffle-server and executor GC : shuffle-server no full gc , but jvm_memory_bytes_used is close to XMX_SIZE="60g" : image executor gc seem normal : image

  1. using the spark ui : dont`t exported the port and need use vpn
  2. metrics into dashboard : had show some metrics , feel shuffle server had too many blocked thread not normal ,but can`t find why

image image image

@zuston How do you tuning with the huge partition ? Could you help me do this ?