parquet-tools merge extremely slow with block-option

asfimport commented 5 years ago

parquet-tools merge is extremely time- and memory-consuming when used with block-option.

The merge function builds a bigger file out of several smaller parquet-files. Used without the block-option it just concatenates the files into a bigger one without building larger row-groups. That doesn't help with query-performance-issues. With block-option, parquet-tools build bigger row-groups which improves the query-performance, but the merge-process itself is extremely slow and memory-consuming.

Consider a case in which you have many small parquet files, e.g. 1000 files with a size of 100kb. Merging them into one file fails on my machine because even 20GB of memory are not enough for the process (the total amount of data as well as the resulting file should be smaller than 100MB).

Different situation: Consider having 100 files of size 1MB. Then merging them is possible with 20GB of RAM, but it takes almoust half an hour to process, which is to much for many use-cases.

Is there any possibility to accelerate the merge and reduce the need of memory?

Reporter: Alexander Gunkel

_{Note: This issue was originally created as PARQUET-1670. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: This is a tough problem. You are right that concatenating the row-groups as they are does not help solving the issue. In the other hand re-building the row groups (at least in the naive way) requires to read back all the values and re-encode them which requires time. You may come up with smarter solutions (currently not implemented) like writing the pages without decoding them. But you have to handle dictionaries which is really problematic. (I cannot see any smart solution for dictionaries.)

Long story short we do not have any fast way as a tool to merge parquet files into one correctly. I think the best way you can do that is to use an existing engine (Spark, Hive etc.) and re-build the whole table or the last partitions of the table.

asfimport commented 5 years ago

Stefan Becker: @gszadovszky: thanks for your quick reply!

I am a colleague of Alex and wanted to share my thoughts as well :)

We have the current setup:

Workstation with multicore 3GHZ CPU and 32GB RAM
100 small 1 mb parquet files
We use 1.12-SNAPSHOT

We run the following command:


java -Xmx26G -jar /opt/apache/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar merge -b big_part*.parquet big_total.parqet

The results are:

One core is 100% busy
We consume up to 20GB of RAM, NO swapping
The command above produces one ~80mb parquet file

So far so good. I understand what you say: the tool needs to read back all the data, decodes it, encodes it and writes it back to disk. But still:
We only need to read 100mb (100x1mb) of data from disk
We only write 80mb to disk

What the heck is he really doing for 30 minutes? :) From my experience those numbers do not fit at all. I do not know the solution but I think I know that something is broken somewhere which does not have to do anything with the way/logic things should work.

Either:
we did something wrong
there is a bug in the implemention that it does more work than it needs to do
it uses some crazy algos where it is trying to compress/merge stuff in a "multi step process" where it goes crazy

Currently we build our own system where we just use parquet files and some other custom build software. We do not use Spark or Hive or anything like that.

I dont get it ...

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: Based on the code and the help message of the command merge there is no option -b. I don't know why it does not complain about it.

The current implementation concatenates the row-groups of the small files so the result file in your case will contain many small row-groups. So, it did not solve the issue as the real problem of the many small files is the many small row-groups. The merge command in the current shape is useless in my opinion. That's why it prints the following message.

"The command doesn't merge row groups, just places one after the other. When used to merge many small files, the resulting file will still contain small row groups, which usually leads to bad query performance."

The parquet-mr library processes the data sequentially so 100% on one core seems to be fine. I don't know why the memory consumption reaches 20GB but a jvm would never do a gc until it reaches the max available memory. So, I guess, the 20GB is full of unused objects which would be garbage collected if required. I also don't know why it is that slow but it does not really matter as the result is not really useful.

Unfortunately, we don't have a properly working tool that could solve your problem. My only idea is to read all the data back from the many files row-by-row and write them to one file.

asfimport commented 5 years ago

Alexander Gunkel: @gszadovszky, there was an option -b until commit ab42fe5180366120336fb3f8b9e6540aadb5da1b (originally introduced in commit 863a081850e56bbbb38d7b68b478a3bd40779723) ;)

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: OK, I forgot about it. I was the one who reverted this feature. It was trying to do some more advanced merging but the concept was not correct. I would not suggest using it.

apache / parquet-java

parquet-tools merge extremely slow with block-option #2374