Vectorized API to support Column Index in Apache Spark

asfimport commented 4 years ago

As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its seems like Apache Spark doesn't support Column Index until we disable vectorizedReader in Spark - which will have other performance implications. As per @zivanfi , parquet-mr should implement a Vectorized API. Is it already implemented or any pull request for the same?

Reporter: Felix Kizhakkel Jose / @FelixKJose

_{Note: This issue was originally created as PARQUET-1830. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, the feature of having a vectorized API in parquet-mr was only a topic in some of our discussions. No efforts have been made to design/implement it. It is unfortunate that both Spark (and Hive) were implemented their own way of vectorization by using parquet-mr internal API (e.g. reading pages directly) instead of having something common in parquet-mr. To have such an API designed and implemented properly we need design input from our users.

However, to support column indexes in Spark we might have some other approaches:

As Spark already use some internal API of parquet-mr we can step forward and implement the page skipping mechanism that is implemented in parquet-mr. pros: might be a quicker solution if Spark community has resources to implement it cons: duplicating code, increasing parquet related code outside of parquet-mr
Having a simpler (not vectorized) API in parquet-mr that puts an abstraction layer on top of pages (by reading the triplets of value, definition level and repetition level from a row group) pros: cleaner API in parquet-mr, possibly cleaner code in Spark, hiding the page skipping mechanism introduced by column indexes cons: lower level API cannot be used anymore (e.g. Spark's own vectorized RLE decoder)

What do you think?

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: Thank you @gszadovszky

IMHO, I prefer option 1 as a short term work around. This could benefit a lot of people by the great performance improvement by the Offset and Column Indexes.

But for long term, in addition to Option 2, make it a Vectorized API and coordinate with Spark team to integrate the work. And avoid spilling of logic outside of core library(parquet) if possible or at least make it minimal.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, agreed. So this jira is to track the long term effort of having a vectorized API in parquet-mr so our clients don't have to use our internal API to have fast reading yet having our ppd filtering (including column indexes and bloom filters) automatically executed under the hood.

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: @gszadovszky This Jira is for long term solution. But do we have any Jira for a short term solution, since that could benefit many who are using Parquet + Spark?

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: @FelixKJose, you said you would prefer option 1. That one would be a Spark only change.

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: Yes, for short term, option 1. But this Jira is for long term solution, so I want this to be open until we have a Vectorized API. I will update the Spark Jira SPARK-26345 for short term solution.

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: Agreed. That's what I wanted to say some comments ago. :)

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: (y)

asfimport commented 4 years ago

Xinli Shang / @shangxinli: @FelixKJose Do we have Spark task created for implementing the short term solution?

asfimport commented 4 years ago

Felix Kizhakkel Jose / @FelixKJose: https://issues.apache.org/jira/browse/SPARK-26345. But no one has picked that Jira yet

apache / parquet-java

Vectorized API to support Column Index in Apache Spark #2476