apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

[WIP][Proposal] PARQUET-2430: Add parquet joiner v2 #1335

Open MaxNevermind opened 2 months ago

MaxNevermind commented 2 months ago

This PR is a proposal and Work In Progress.

This is a simplified version of original PR: [WIP][Proposal] PARQUET-2430: Add parquet joiner

The simplified design:

MaxNevermind commented 2 months ago

@wgtmac @ConeyLiu

This PR is the outcome of simplification I mention in a comment here a couple of weeks ago: https://github.com/apache/parquet-mr/pull/1273#issuecomment-2053772198 I’ve limited the set of capabilities, see this PR description. I’ve tired different ideas and it all come out as having too complex of implementation, so I decided to finalize at least something with as simple implementation as possible. PR is not yet polished. Just wanted to do a quick overview of the new approach. If it looks good, I will polish it.

wgtmac commented 2 months ago

Thanks for your effort! I just took a quick glimpse and it does look simpler than the previous patch.

My general question is that now the prerequisite for users to use the joiner is to run a pre-processing like https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d you've mentioned. The pre-processing also takes time and resource. Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

MaxNevermind commented 2 months ago

Does it mean that we have to deal with unaligned blocks anyway if users do not want to pay for the pre-processing task?

This new implementation requires blocks to be aligned yes. The gist snippet preparing it need to be updated btw, that one is for the previous version.

I think this version strikes a good balance in terms of features vs implementation complexity, putting all the features as in previous version leads to a very complex implementation imo which I'm not sure is worth pursuing as it is mainly optimization for a pretty niche use-case, for regular use-cases you can just read write the whole thing, considering a niche use-case it is reasonable to assume for users to go extra mile and run that snippet and prepare the data in required way.

wgtmac commented 2 months ago

Yes, I agree that we can start from the implementation with the assumption that row groups of files are aligned. One thing that I'm not sure is that some users may not be easy to generate files to join with same row group alignment and any way require the rewrite tool to handle this.