Open zuston opened 1 year ago
I encounter this problem in our online jobs, but I don't dig into it. Anyway, from my side, this feature is an improvement.
cc @xianjingfeng @advancedxy @jerqi
You mean map side combine?
Well, last time I checked, uniffle don't do any combine, just create a combined value for each record.
If we are going to support this feature, maybe a lot of client side code should be rewrote...😇
You mean map side combine?
Yes
If we are going to support this feature, maybe a lot of client side code should be rewrote...😇
Yes. 🙆
we also have this demand. I met in hudi table which will use countByKey
to evaluation work load.
any plan to support it?
any plan to support it?
haven't. If you want this, feel free to discuss more
how about use ExternalAppendOnlyMap
to archive it, but it's may spill to disk, we can add a param to controll it. And I think if map side can be combine, it will reduce shuffle data and reduce data skew problems
ExternalAppendOnlyMap is used to handle huge data, I think it is not necessary. We can sort and combine in memory, then serialize the records. I did it in #1239 (developing) by this way. But I bind only the use of the remote merge feature.
I afraid that the proportion of being combined may be not high, since we use pipelined shuffle in rss, only partial record will be combined.
The client and server side merge are all needed.
I think the merging should work on the client side, not on the RSS server. It maximizes the utilization of executor resources, rather than placing the computation burden on the RSS server, whose resources are shared and need to ensure stability and the normal operation of core services.
And furthermore, I think the RSS server is best left to handle data transmission efficiently without any content processing. It should return whatever it receives from user tasks. All data content processing is preferably performed by the executors, allowing the RSS server to focus solely on efficient shuffle transmission. What do you think?
@KnightChess
If you think RSS sever should not handle any content, you can save the shuffle date to 3rd remote filesystem, and read the shuffle data from this filesystem.
What is your purpose of using remote shuffle service?
For me, the main purpose for remote merge in shuffle side is that application will not use any local file as shuffle storage, then we can realize the separation of storage and computing, then support hybrid deployment.
Then Remote merge is optional function. And If we merge on server side, we can reduce the resource usage of executor, then add the resource to shuffle server. That's not a problem.
I think the merging should work on the client side, not on the RSS server.
I think the zheng's design motivation is not same with you @KnightChess , if you want this, client merge design is OK for me, that is not conflict with remote merge. Right @zhengchenyu
Well, yes. I'm just offering a suggestion, sort in memory. There is no conflict!
Code of Conduct
Search before asking
Describe the feature
In spark client, currently uniffle don't support merge for some ops supporting combine to reduce data size. Due to this, in some cases, it will cause unnecessary network and performance regression.
Motivation
No response
Describe the solution
No response
Additional context
No response
Are you willing to submit PR?