Closed asfimport closed 5 years ago
Hatem Helal / @hatemhelal: I think this is a problem in parquet-cpp since I've confirmed that parquet-tools can read this file.
Hatem Helal / @hatemhelal: @wesm, my colleague @rdmello is working on a fix for this. Could you help us out by adding him as a contributor on this project? Thanks!
Wes McKinney / @wesm: Done
Wes McKinney / @wesm: Issue resolved by pull request 3312 https://github.com/apache/arrow/pull/3312
Tera G: Hi Everyone,
I see that this fix has been made in arrow's record reader (record_reader.cc). I am using the parquet's low-level API to pull the data from the parquet file in my application.
I am facing the exact problem fixed by this Jira while using the Parquet's low level API.(column_reader.cc).
As the current fix is not ported to the low level parquet api, I wanted to know if there are any plans to ship these changes to the low-level-api ?
Also, @rdmello, can I simply port the fixes you have made in the parquet low-level api ? Will this work ?
We are using low-level api as it offers more power to us in terms of predicate push down, filtering and skipping of data.
Finally, Is the Open source community's push is to advise developers to use arrow's parquet api or the low level parquet api to access the parquet data ?
Thank you in advance for your response.
Rylan Dmello / @rdmello:
Hi [~terag]
, I haven't looked at implementing these changes with the low-level API yet. I see that "column_reader.cc" has a similar TypedRecordReader method as "record_reader.cc", and that there's a similar conditional statement there that excludes DATA_PAGE_V2 pages.
I'm not super familiar with the low-level API, but I think a similar set of changes might work for fixing this issue with the low-level API too. If you already have code that fixes this, I'd recommend sending in a pull request for this. Otherwise I can take a closer look at porting this fix to the low-level API tomorrow.
Tera G: Hi @rdmello,
Thank you so much for your quick response.
No, we have not yet started making those changes. I will really appreciate if you can make those changes.
Thanks again.
Rylan Dmello / @rdmello:
Hi [~terag]
, sorry, I did take a look at this, but didn't really have the time to resolve this over the last few weeks.
I just opened a new Jira issue to add basic DataPageV2 support to the low-level API: https://issues.apache.org/jira/browse/PARQUET-1560 . I can add updates to that issue instead of this one, since this is already resolved.
I couldn't easily reproduce the issue when using the low-level API to read the 'feeds1kMicros.parquet' file generated by parquetjs. Either this has already been fixed in arrow/master, or I might need to dig deeper to understand the problem. Do you possibly have an example parquet file which isn't readable with the low-level API? If so, feel free to attach it to the new Jira issue I linked.
See attached file, when I debug:
% ./parquet-reader feed1kMicros.parquet
I see that the
scanner->HasNext()
always returns false.Reporter: Hatem Helal / @hatemhelal Assignee: Rylan Dmello / @rdmello
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as PARQUET-1482. Please see the migration documentation for further details.