apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[C++][Parquet] Unable to read data from parquet file generated with parquetjs #42868

Closed asfimport closed 5 years ago

asfimport commented 5 years ago

See attached file, when I debug:

% ./parquet-reader feed1kMicros.parquet

I see that the scanner->HasNext() always returns false.

Reporter: Hatem Helal / @hatemhelal Assignee: Rylan Dmello / @rdmello

Original Issue Attachments:

Note: This issue was originally created as PARQUET-1482. Please see the migration documentation for further details.

asfimport commented 5 years ago

Hatem Helal / @hatemhelal: I think this is a problem in parquet-cpp since I've confirmed that parquet-tools can read this file.

asfimport commented 5 years ago

Hatem Helal / @hatemhelal: @wesm, my colleague @rdmello  is working on a fix for this.  Could you help us out by adding him as a contributor on this project?  Thanks!

asfimport commented 5 years ago

Wes McKinney / @wesm: Done

asfimport commented 5 years ago

Wes McKinney / @wesm: Issue resolved by pull request 3312 https://github.com/apache/arrow/pull/3312

asfimport commented 5 years ago

Tera G: Hi Everyone,

I see that this fix has been made in arrow's record reader (record_reader.cc). I am using the parquet's low-level API to pull the data from the parquet file in my application.

I am facing the exact problem fixed by this Jira while using the Parquet's low level API.(column_reader.cc).  

As the current fix is not ported to the low level parquet api, I wanted to know if there are any plans to ship these changes to the low-level-api ? 

Also, @rdmello, can I simply port the fixes you have made in the parquet low-level api ? Will this work ? 

We are using low-level api as it offers more power to us in terms of predicate push down, filtering and skipping of data.

Finally, Is the Open source community's push is to advise developers to use arrow's parquet api or the low level parquet api to access the parquet data ? 

Thank you in advance for your response. 

asfimport commented 5 years ago

Rylan Dmello / @rdmello: Hi [~terag], I haven't looked at implementing these changes with the low-level API yet. I see that "column_reader.cc" has a similar TypedRecordReader method as "record_reader.cc", and that there's a similar conditional statement there that excludes DATA_PAGE_V2 pages.

I'm not super familiar with the low-level API, but I think a similar set of changes might work for fixing this issue with the low-level API too. If you already have code that fixes this, I'd recommend sending in a pull request for this. Otherwise I can take a closer look at porting this fix to the low-level API tomorrow.

asfimport commented 5 years ago

Tera G: Hi @rdmello, 

Thank you so much for your quick response. 

No, we have not yet started making those changes. I will really appreciate if you can make those changes.

Thanks again.

 

asfimport commented 5 years ago

Tera G: Hi @rdmello, 

Did you get the time to look into the problem ?

Thanks.

asfimport commented 5 years ago

Rylan Dmello / @rdmello: Hi [~terag], sorry, I did take a look at this, but didn't really have the time to resolve this over the last few weeks.

I just opened a new Jira issue to add basic DataPageV2 support to the low-level API: https://issues.apache.org/jira/browse/PARQUET-1560 . I can add updates to that issue instead of this one, since this is already resolved.

I couldn't easily reproduce the issue when using the low-level API to read the 'feeds1kMicros.parquet' file generated by parquetjs. Either this has already been fixed in arrow/master, or I might need to dig deeper to understand the problem. Do you possibly have an example parquet file which isn't readable with the low-level API? If so, feel free to attach it to the new Jira issue I linked.

asfimport commented 5 years ago

Tera G: Hi @rdmello, 

sorry for the delayed response. I was on vacation from last 2 weeks. 

I have attached the v2 file to PARQUET-1560 JIRA.