Open asfimport opened 4 years ago
Gabor Szadovszky / @gszadovszky: @FelixKJose, the feature of having a vectorized API in parquet-mr was only a topic in some of our discussions. No efforts have been made to design/implement it. It is unfortunate that both Spark (and Hive) were implemented their own way of vectorization by using parquet-mr internal API (e.g. reading pages directly) instead of having something common in parquet-mr. To have such an API designed and implemented properly we need design input from our users.
However, to support column indexes in Spark we might have some other approaches:
Having a simpler (not vectorized) API in parquet-mr that puts an abstraction layer on top of pages (by reading the triplets of value, definition level and repetition level from a row group) pros: cleaner API in parquet-mr, possibly cleaner code in Spark, hiding the page skipping mechanism introduced by column indexes cons: lower level API cannot be used anymore (e.g. Spark's own vectorized RLE decoder)
What do you think?
Felix Kizhakkel Jose / @FelixKJose: Thank you @gszadovszky
IMHO, I prefer option 1 as a short term work around. This could benefit a lot of people by the great performance improvement by the Offset and Column Indexes.
But for long term, in addition to Option 2, make it a Vectorized API and coordinate with Spark team to integrate the work. And avoid spilling of logic outside of core library(parquet) if possible or at least make it minimal.
Gabor Szadovszky / @gszadovszky: @FelixKJose, agreed. So this jira is to track the long term effort of having a vectorized API in parquet-mr so our clients don't have to use our internal API to have fast reading yet having our ppd filtering (including column indexes and bloom filters) automatically executed under the hood.
Felix Kizhakkel Jose / @FelixKJose: @gszadovszky This Jira is for long term solution. But do we have any Jira for a short term solution, since that could benefit many who are using Parquet + Spark?
Gabor Szadovszky / @gszadovszky: @FelixKJose, you said you would prefer option 1. That one would be a Spark only change.
Felix Kizhakkel Jose / @FelixKJose: Yes, for short term, option 1. But this Jira is for long term solution, so I want this to be open until we have a Vectorized API. I will update the Spark Jira SPARK-26345 for short term solution.
Gabor Szadovszky / @gszadovszky: Agreed. That's what I wanted to say some comments ago. :)
Felix Kizhakkel Jose / @FelixKJose: (y)
Xinli Shang / @shangxinli: @FelixKJose Do we have Spark task created for implementing the short term solution?
Felix Kizhakkel Jose / @FelixKJose: https://issues.apache.org/jira/browse/SPARK-26345. But no one has picked that Jira yet
As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its seems like Apache Spark doesn't support Column Index until we disable vectorizedReader in Spark - which will have other performance implications. As per @zivanfi , parquet-mr should implement a Vectorized API. Is it already implemented or any pull request for the same?
Reporter: Felix Kizhakkel Jose / @FelixKJose
Note: This issue was originally created as PARQUET-1830. Please see the migration documentation for further details.