apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

parquet-tools merge extremely slow with block-option #2374

Open asfimport opened 5 years ago

asfimport commented 5 years ago

parquet-tools merge is extremely time- and memory-consuming when used with block-option.

 

The merge function builds a bigger file out of several smaller parquet-files. Used without the block-option it just concatenates the files into a bigger one without building larger row-groups. That doesn't help with query-performance-issues. With block-option, parquet-tools build bigger row-groups which improves the query-performance, but the merge-process itself is extremely slow and memory-consuming.

 

Consider a case in which you have many small parquet files, e.g. 1000 files with a size of 100kb. Merging them into one file fails on my machine because even 20GB of memory are not enough for the process (the total amount of data as well as the resulting file should be smaller than 100MB).

 

Different situation: Consider having 100 files of size 1MB. Then merging them is possible with 20GB of RAM, but it takes almoust half an hour to process, which is to much for many use-cases.

 

Is there any possibility to accelerate the merge and reduce the need of memory?

Reporter: Alexander Gunkel

Note: This issue was originally created as PARQUET-1670. Please see the migration documentation for further details.

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: This is a tough problem. You are right that concatenating the row-groups as they are does not help solving the issue. In the other hand re-building the row groups (at least in the naive way) requires to read back all the values and re-encode them which requires time. You may come up with smarter solutions (currently not implemented) like writing the pages without decoding them. But you have to handle dictionaries which is really problematic. (I cannot see any smart solution for dictionaries.)

Long story short we do not have any fast way as a tool to merge parquet files into one correctly. I think the best way you can do that is to use an existing engine (Spark, Hive etc.) and re-build the whole table or the last partitions of the table.

asfimport commented 5 years ago

Stefan Becker: @gszadovszky: thanks for your quick reply!

I am a colleague of Alex and wanted to share my thoughts as well :)

We have the current setup:

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: Based on the code and the help message of the command merge there is no option -b. I don't know why it does not complain about it.

The current implementation concatenates the row-groups of the small files so the result file in your case will contain many small row-groups. So, it did not solve the issue as the real problem of the many small files is the many small row-groups. The merge command in the current shape is useless in my opinion. That's why it prints the following message.

"The command doesn't merge row groups, just places one after the other. When used to merge many small files, the resulting file will still contain small row groups, which usually leads to bad query performance."

The parquet-mr library processes the data sequentially so 100% on one core seems to be fine. I don't know why the memory consumption reaches 20GB but a jvm would never do a gc until it reaches the max available memory. So, I guess, the 20GB is full of unused objects which would be garbage collected if required. I also don't know why it is that slow but it does not really matter as the result is not really useful.

Unfortunately, we don't have a properly working tool that could solve your problem. My only idea is to read all the data back from the many files row-by-row and write them to one file.

asfimport commented 5 years ago

Alexander Gunkel: @gszadovszky, there was an option -b until commit ab42fe5180366120336fb3f8b9e6540aadb5da1b (originally introduced in commit 863a081850e56bbbb38d7b68b478a3bd40779723) ;)

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: OK, I forgot about it. I was the one who reverted this feature. It was trying to do some more advanced merging but the concept was not correct. I would not suggest using it.