Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.84k stars 2.94k forks source link

Excessive number of block reader threads in Alluxio worker. #17551

Open humengyu2012 opened 1 year ago

humengyu2012 commented 1 year ago

Alluxio Version:

2.9.3

Describe the bug

Our Alluxio cluster is not in the same data center as HDFS. When the dedicated inter-data center bandwidth is fully utilized, the number of block reader threads of the workers will increase significantly. Moreover, when the inter-data center bandwidth recovers to normal, the Alluxio worker will not automatically recover, which will affect clients and prevent them from reading data. To address this issue, we can only restore the system by restarting the worker. We have a alluxio cluster that has not experienced similar issues since the dedicated inter-data center bandwidth was expanded.

image

Expected behavior

When the dedicated inter-data center bandwidth recovers to normal, Alluxio worker will also recover automatically.

Urgency

Not very urgent.

uniqueZt commented 1 year ago

i also meet the same problem, i find that may be occers thread leak

jiacheliu3 commented 1 year ago

@fuzhengjia do you know who this should be assigned to? Tks

jiacheliu3 commented 1 year ago

When the network is slow, new read requests will make the thread pool create more threads to handle them. So I think the first part of behavior is expected. Your screenshot shows ~100 reader threads at max, which is totally normal. https://github.com/Alluxio/alluxio/blob/06ffdd5ebfc14b087bf367042a9f85e4de9a3033/core/server/worker/src/main/java/alluxio/worker/grpc/GrpcExecutors.java#L63

What do you mean by Moreover, when the inter-data center bandwidth recovers to normal, the Alluxio worker will not automatically recover, which will affect clients and prevent them from reading data.? If the request flow quiets down, some threads will be released from the thread pool (after being idle for more than 10s THREAD_STOP_MS = Constants.SECOND_MS * 10). But note that there are still new requests coming in so the thread count should go down slowly, because a thread needs to be idle(not one single request for it) for more than 10s to be released. I don't understand how that prevents clients from reading data? Do you have a stacktrace or the client/worker jstack?

By default the block reader thread pool can have 2048 threads. Did you observe more than 100 threads in the pool?

fyi @dbw9580 for visibility

dbw9580 commented 1 year ago

Can you take a jstack dump and see if there are any block reader thread in a blocked state that can not make any progress, even though there are no outstanding client requests?

humengyu2012 commented 1 year ago

This is a very strange problem, and I can hardly reproduce it when the network bandwidth is sufficient. If the same issue occurs again, I will reply with the jstack information of the worker here. However, we may never be able to reproduce this issue again because our dedicated inter-data center network has been expanded, and there will no longer be a saturation situation.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.