apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
384 stars 149 forks source link

[DOCS] Improving the description of several parameters #955

Open fhalde opened 1 year ago

fhalde commented 1 year ago

Code of Conduct

Search before asking

Which parts of the documentation do you think need improvement?

The following configurations are a bit confusing

rss.coordinator.shuffle.nodes.max is assignment per shuffle partition or per shuffle writer? why can there be multiple assignments in the first place?

rss.coordinator.remote.storage.select.strategy what does IO_BALANCE and APP_BALANCE even mean?

rss.server.buffer.capacity is this a shared global buffer or per app buffer or per connection buffer? how to tune this? what to look for?

rss.server.read.buffer.capacity is this global or per shuffle reader? how to tune it? what to look for?

rss.server.single.buffer.flush.enabled what is a single buffer?

Affects Version(s)

master

Improving the documentation

No response

Anything else

No response

Are you willing to submit PR?

jerqi commented 1 year ago
  1. rss.coordinator.shuffle.nodes.max is max shuffle nodes which we assign to one shuffle. For example, We have two shuffles, one have 5 partitions, another have 200 partitions, rss.coordinator.shuffle.nodes.max is 10. We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle.
  2. rss.server.buffer.capacity is global buffer. We should according to the partition number of shuffle server. We should guarantee thatbuffer.capacity / the number of partition will larger than 2MB, otherwise it will cause too much random io.
  3. Every reduce partition is a single buffer. For example: map1 have data reduce 1, reduce2, reduce3. map2 have data reduce 1, reduce2, reduce3. One shuffle node will collect the reduce 1 of map1 and reduce 1 of map2. reduce 1 is a reduce partition.
fhalde commented 1 year ago

thanks so much, also in rss-env.sh. there's a reference to XMX_SIZE. should the buffer sizes defined influence the JVM size of the shuffle server? or does allocation happen off heap?

fhalde commented 1 year ago

We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle oh i see, is this to reduce the spread of the data? does the rss server merge partitions coming from different mappers?

We should according to the partition number of shuffle server is this configurable? is this the same as rss.coordinator.shuffle.nodes.max ?

fhalde commented 1 year ago

the assignment process warrants better documentation. let me first try to understand it by myself

jerqi commented 1 year ago

We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle oh i see, is this to reduce the spread of the data? does the rss server merge partitions coming from different mappers?

No, we don't merge reduce1, reduce 2.Although they can be sent to the same shuffle node, we still use different memory buffer to store them now.

We should according to the partition number of shuffle server is this configurable? is this the same as rss.coordinator.shuffle.nodes.max ?

No config option. It's just experience formula from me now.

fhalde commented 1 year ago

ok. what do you mean by

We should set it according to the partition number of shuffle server

fhalde commented 1 year ago

also, any input for https://github.com/apache/incubator-uniffle/issues/955#issuecomment-1593233428 ?

jerqi commented 1 year ago

ok. what do you mean by

We should set it according to the partition number of shuffle server

Actually, we just the use 80G heap to run the shuffle server in our production env. Our machine is 150G. It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server. We estimate the max task concurrency in our production env. Max task concurrency means the max partitions which we need hold it in the memory of shuffle sever.

jerqi commented 1 year ago

thanks so much, also in rss-env.sh. there's a reference to XMX_SIZE. should the buffer sizes defined influence the JVM size of the shuffle server? or does allocation happen off heap?

  1. Yes, buffer size should be 0.7 of XMX_SIZE. 2.We don't support offheap management now. We're developing the feature and tuning this feature.
fhalde commented 1 year ago

It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server.

very impressive. how much storage ( disk ) do you typically attach per shuffle server? we have some jobs that shuffle almost 150TB of data. One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle

fhalde commented 1 year ago

Yes, buffer size should be 0.7 of XMX_SIZE.

ok, so can i set both rss.server.buffer.capacity and rss.server.read.buffer.capacity to 0.7 of xmx size or should they be split into 0.35 each ?

is rss.server.buffer.capacity for write and rss.server.read.buffer.capacity for read ?

jerqi commented 1 year ago

Yes, buffer size should be 0.7 of XMX_SIZE.

ok, so can i set both rss.server.buffer.capacity and rss.server.read.buffer.capacity to 0.7 of xmx size or should they be split into 0.35 each ?

is rss.server.buffer.capacity for write and rss.server.read.buffer.capacity for read ?

  1. rss.server.buffer.capacity is 0.7 rss.server.read.buffer.capacity is 0.2 0.1 is for other program.
  2. Yes
jerqi commented 1 year ago

It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server.

very impressive. how much storage ( disk ) do you typically attach per shuffle server? we have some jobs that shuffle almost 150TB of data. One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle

For 150TB shuffle, some config options will be recommended. First. we would like to use MEMORY_LOCALFILE_HDFS, because HDFS have more IO resource and more disk space. Second, use more shuffle servers, may be more than 20. Third. use single.buffer.limit and rss.server.max.concurrency.of.per-partition.write. When reduce partition reach a size, we will flush it to HDFS and rss.server.max.concurrency.of.per-partition.write will use multiple threads to write HDFS data. This feature will have better effect after https://github.com/apache/incubator-uniffle/pull/775 Fourth. We haven't support S3. Because community don't have enough people although we have similar plan. S3 is different from HDFS, it will have more optimization.

jerqi commented 1 year ago

This may be useful for you. https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md

zuston commented 1 year ago

we have some jobs that shuffle almost 150TB of data.

Do you mean that the 150TB for stage total shuffle data size or just is per partition?

If it's the former, I think the single shuffle-server tolerant disk capacity * shuffle-server size must hold the 150TB.

And if it's the latter, the jerqi's suggestion is useful, which is for holding huge partition for per-partition. And the issue has been solved in https://github.com/apache/incubator-uniffle/issues/378

One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle

+1. Feel the same