Open fhalde opened 1 year ago
rss.coordinator.shuffle.nodes.max
is max shuffle nodes which we assign to one shuffle. For example,
We have two shuffles, one have 5 partitions, another have 200 partitions, rss.coordinator.shuffle.nodes.max
is 10.
We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle. rss.server.buffer.capacity
is global buffer. We should according to the partition number of shuffle server. We should guarantee thatbuffer.capacity
/ the number of partition will larger than 2MB, otherwise it will cause too much random io.thanks so much, also in rss-env.sh
. there's a reference to XMX_SIZE
. should the buffer sizes defined influence the JVM size of the shuffle server? or does allocation happen off heap?
We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle oh i see, is this to reduce the spread of the data? does the rss server merge partitions coming from different mappers?
We should according to the partition number of shuffle server is this configurable? is this the same as
rss.coordinator.shuffle.nodes.max
?
the assignment process warrants better documentation. let me first try to understand it by myself
We will assign 5 shuffle nodes to the first shuffle. We will assign another 10 shuffle nodes to the second shuffle oh i see, is this to reduce the spread of the data? does the rss server merge partitions coming from different mappers?
No, we don't merge reduce1, reduce 2.Although they can be sent to the same shuffle node, we still use different memory buffer to store them now.
We should according to the partition number of shuffle server is this configurable? is this the same as
rss.coordinator.shuffle.nodes.max
?
No config option. It's just experience formula from me now.
ok. what do you mean by
We should set it according to the partition number of shuffle server
also, any input for https://github.com/apache/incubator-uniffle/issues/955#issuecomment-1593233428 ?
ok. what do you mean by
We should set it according to the partition number of shuffle server
Actually, we just the use 80G heap to run the shuffle server in our production env. Our machine is 150G. It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server. We estimate the max task concurrency in our production env. Max task concurrency means the max partitions which we need hold it in the memory of shuffle sever.
thanks so much, also in
rss-env.sh
. there's a reference toXMX_SIZE
. should the buffer sizes defined influence the JVM size of the shuffle server? or does allocation happen off heap?
It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server.
very impressive. how much storage ( disk ) do you typically attach per shuffle server? we have some jobs that shuffle almost 150TB of data. One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle
Yes, buffer size should be 0.7 of XMX_SIZE.
ok, so can i set both rss.server.buffer.capacity
and rss.server.read.buffer.capacity
to 0.7 of xmx size or should they be split into 0.35 each ?
is rss.server.buffer.capacity
for write and rss.server.read.buffer.capacity
for read ?
Yes, buffer size should be 0.7 of XMX_SIZE.
ok, so can i set both
rss.server.buffer.capacity
andrss.server.read.buffer.capacity
to 0.7 of xmx size or should they be split into 0.35 each ?is
rss.server.buffer.capacity
for write andrss.server.read.buffer.capacity
for read ?
It's enough for several TBs shuffle with 1000 executors to use 9 shuffle server.
very impressive. how much storage ( disk ) do you typically attach per shuffle server? we have some jobs that shuffle almost 150TB of data. One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle
For 150TB shuffle, some config options will be recommended.
First. we would like to use MEMORY_LOCALFILE_HDFS, because HDFS have more IO resource and more disk space.
Second, use more shuffle servers, may be more than 20.
Third. use single.buffer.limit
and rss.server.max.concurrency.of.per-partition.write
. When reduce partition reach a size, we will flush it to HDFS and rss.server.max.concurrency.of.per-partition.write
will use multiple threads to write HDFS data.
This feature will have better effect after https://github.com/apache/incubator-uniffle/pull/775
Fourth. We haven't support S3. Because community don't have enough people although we have similar plan. S3 is different from HDFS, it will have more optimization.
This may be useful for you. https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark.md
we have some jobs that shuffle almost 150TB of data.
Do you mean that the 150TB for stage total shuffle data size or just is per partition?
If it's the former, I think the single shuffle-server tolerant disk capacity * shuffle-server size must hold the 150TB.
And if it's the latter, the jerqi's suggestion is useful, which is for holding huge partition for per-partition. And the issue has been solved in https://github.com/apache/incubator-uniffle/issues/378
One could argue that the job needs to be re-written but as a platform we mostly have no control over when the job gets fixed to reduce the shuffle
+1. Feel the same
Code of Conduct
Search before asking
Which parts of the documentation do you think need improvement?
The following configurations are a bit confusing
Affects Version(s)
master
Improving the documentation
No response
Anything else
No response
Are you willing to submit PR?