apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
387 stars 149 forks source link

[MINOR] improvement(server,coordinator): Support config grpc queue size #2175

Closed maobaolong closed 1 month ago

maobaolong commented 1 month ago

What changes were proposed in this pull request?

Support config grpc thread pool waiting queue size

Why are the changes needed?

Without this, the server could be rush full of rpc call and cost all memory.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

github-actions[bot] commented 1 month ago

Test Results

 2 926 files  ±0   2 926 suites  ±0   5h 58m 7s :stopwatch: - 5m 17s  1 041 tests ±0   1 039 :white_check_mark: ±0   2 :zzz: ±0  0 :x: ±0  13 003 runs  ±0  12 973 :white_check_mark: ±0  30 :zzz: ±0  0 :x: ±0 

Results for commit 2ab1019a. ± Comparison against base commit 78fe934e.

:recycle: This comment has been updated with latest results.

maobaolong commented 1 month ago

@jerqi Would you like to take a look at this PR? Thanks!

jerqi commented 1 month ago

Could you elaborate more about why we need this?

maobaolong commented 1 month ago
        new GrpcThreadPoolExecutor(
            rpcExecutorSize,
            rpcExecutorSize * 2,
            10,
            TimeUnit.MINUTES,
            Queues.newLinkedBlockingQueue(Integer.MAX_VALUE),
            ThreadUtils.getThreadFactory("Grpc"),
            grpcMetrics);

@jerqi Reference this code, the original author set the different corePoolSize and maxPoolSize, it means that the author expected the working thread could be grow up to 2 time of corePoolSize, but he/she use a unlimited size queue here, so the working thread number could never grow up unless the waiting queue full, it could never be able to full for this unlimited queue.

So I supply a way to customize the queue size for cluster maintainer.

jerqi commented 1 month ago
        new GrpcThreadPoolExecutor(
            rpcExecutorSize,
            rpcExecutorSize * 2,
            10,
            TimeUnit.MINUTES,
            Queues.newLinkedBlockingQueue(Integer.MAX_VALUE),
            ThreadUtils.getThreadFactory("Grpc"),
            grpcMetrics);

@jerqi Reference this code, the original author set the different corePoolSize and maxPoolSize, it means that the author expected the working thread could be grow up to 2 time of corePoolSize, but he/she use a unlimited size queue here, so the working thread number could never grow up unless the waiting queue full, it could never be able to full for this unlimited queue.

So I supply a way to customize the queue size for cluster maintainer.

Should we set a proper queue size here? Is MAX_VAUE a proper default value?

maobaolong commented 1 month ago

@jerqi Reference to others

I have no idea to choose one of them

  1. Default to MAX Integer, so it can have not effect to existing code, but really, it could make server crash for huge current scenario, but we can set it to a proper value.
  2. Default to max pool size which default to 2 * rpcExecutorSize = 2000, base on this default config, server could accept and hold 4000 rpc call, which could make reject rpc call, so we need to enlarge the rpcExecutorSize default from 1000 to 5000, so that we can accept and hold 10000 rpc call, it could be a little better.
  3. Default to a constant value, let's say 10000, it could be enough for these few year.

Think twice, it really hard to choose, but i could choose choice 1 if I have to choose one.

How about your idea?

jerqi commented 1 month ago

@jerqi Reference to others

  • Hadoop "ipc.server.handler.queue.size" default to 100, it is too small for nowaday.
  • Alluxio default to max pool size, which default to 500.

I have no idea to choose one of them

  1. Default to MAX Integer, so it can have not effect to existing code, but really, it could make server crash for huge current scenario, but we can set it to a proper value.
  2. Default to max pool size which default to 2 * rpcExecutorSize = 2000, base on this default config, server could accept and hold 4000 rpc call, which could make reject rpc call, so we need to enlarge the rpcExecutorSize default from 1000 to 5000, so that we can accept and hold 10000 rpc call, it could be a little better.
  3. Default to a constant value, let's say 10000, it could be enough for these few year.

Think twice, it really hard to choose, but i could choose choice 1 if I have to choose one.

How about your idea?

Ok for me if we haven't better solution.

maobaolong commented 1 month ago

@jerqi OK, thanks for your reply.