apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.63k stars 1.41k forks source link

Vectorized API to support Column Index in Apache Spark #2476

Open asfimport opened 4 years ago

asfimport commented 4 years ago

As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its seems like Apache Spark doesn't support Column Index until we disable vectorizedReader in Spark - which will have other performance implications. As per  @zivanfi  , parquet-mr should implement a Vectorized API. Is it already implemented or any pull request for the same?

Reporter: Felix Kizhakkel Jose / @FelixKJose

Note: This issue was originally created as PARQUET-1830. Please see the migration documentation for further details.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, the feature of having a vectorized API in parquet-mr was only a topic in some of our discussions. No efforts have been made to design/implement it. It is unfortunate that both Spark (and Hive) were implemented their own way of vectorization by using parquet-mr internal API (e.g. reading pages directly) instead of having something common in parquet-mr. To have such an API designed and implemented properly we need design input from our users.

However, to support column indexes in Spark we might have some other approaches:

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: Thank you @gszadovszky

IMHO, I prefer option 1 as a short term work around. This could benefit a lot of people by the great performance improvement by the Offset and Column Indexes.

But for long term, in addition to Option 2, make it a Vectorized API and coordinate with Spark team to integrate the work. And avoid spilling of logic outside of core library(parquet) if possible or at least make it minimal.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, agreed. So this jira is to track the long term effort of having a vectorized API in parquet-mr so our clients don't have to use our internal API to have fast reading yet having our ppd filtering (including column indexes and bloom filters) automatically executed under the hood.

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: @gszadovszky This Jira is for long term solution. But do we have any Jira for a short term solution, since that could benefit many who are using Parquet + Spark?

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, you said you would prefer option 1. That one would be a Spark only change.

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: Yes, for short term, option 1. But this Jira is for long term solution, so I want this to be open until we have a Vectorized API.  I will update the Spark Jira SPARK-26345 for short term solution.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: Agreed. That's what I wanted to say some comments ago. :)

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: (y)

asfimport commented 4 years ago

Xinli Shang / @shangxinli: @FelixKJose Do we have Spark task created for implementing the short term solution?

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: https://issues.apache.org/jira/browse/SPARK-26345. But no one has picked that Jira yet