Closed sfescape closed 5 years ago
Can you please post the entire code/cmds required to replicate. It's hard to see whether this is a parquetjs
or parquet-cpp
issue without any futher information.
Can you try writing the same file with parquet-cpp
and read it with parquet-tools
and parquetjs
?
The command to write was example/writers.js. Using the simple example on the front page produces similar output.
I used https://github.com/skale-me/node-parquet to read the file.
`var parquet = require('./index');
var reader = new parquet.ParquetReader('fruits.parquet'); console.log(reader.info()); console.log(reader.rows()); reader.close();`
Writing a simple file using node-parquet is readable by parquetjs (although parquetjs complains that the date fields are invalid which I haven't looked at yet. They looked fine at first glance using parquet-tools).
As I mentioned in the other issue I filed, files written by parquetjs are also not readable by dremio (which also uses parquet-cpp). It's quite a pain to build stuff based on parquet-cpp.
Oh, all of the output of parquetjs is readable by parquet-tools. Same for the node-parquet output.
name: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:91/91/1.00 VC:3 ENC:PLAIN,RLE quantity: INT64 UNCOMPRESSED DO:0 FPO:161 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE price: DOUBLE UNCOMPRESSED DO:0 FPO:326 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE date: INT64 UNCOMPRESSED DO:0 FPO:492 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:657 SZ:41/41/1.00 VC:3 ENC:PLAIN,RLE colour: BINARY UNCOMPRESSED DO:0 FPO:745 SZ:100/100/1.00 VC:5 ENC:PLAIN,RLE meta_json: BINARY UNCOMPRESSED DO:0 FPO:904 SZ:209/209/1.00 VC:3 ENC:PLAIN,RLE
name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:29 VC:3
quantity TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 10, max: 20, num_nulls: 1] SZ:20 [more]...
price TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 2.60000, max: 4.20000, [more]... VC:3
date TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 1534704624839000, max: [more]... VC:3
in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: false, max: true, num_nulls: 0] [more]...
colour TV=5 RL=1 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:48 VC:5
meta_json TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:41 VC:3
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:apples value 2: R:0 D:0 V:oranges value 3: R:0 D:0 V:kiwi
row group 1 of 1, values 1 to 3
value 1: R:0 D:1 V:10
value 2: R:0 D:1 V:20
value 3: R:0 D:0 V:
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:2.6 value 2: R:0 D:0 V:2.7 value 3: R:0 D:0 V:4.2
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534704624839000 value 2: R:0 D:0 V:1534704624840000 value 3: R:0 D:0 V:1534704624840000
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:false
row group 1 of 1, values 1 to 5 value 1: R:0 D:1 V:green value 2: R:1 D:1 V:red value 3: R:0 D:1 V:orange value 4: R:0 D:1 V:green value 5: R:1 D:1 V:brown
row group 1 of 1, values 1 to 3
value 1: R:0 D:0 V:
parquet_tools output for the node-parquet file:
name: BOOLEAN SNAPPY DO:0 FPO:4 SZ:30/28/0.93 VC:3 ENC:RLE,PLAIN quantity: INT64 SNAPPY DO:78 FPO:102 SZ:70/66/0.94 VC:3 ENC:PLAIN_DIC [more]... price: DOUBLE SNAPPY DO:232 FPO:256 SZ:76/72/0.95 VC:3 ENC:PLAIN_D [more]... date: INT64 SNAPPY DO:389 FPO:413 SZ:70/66/0.94 VC:3 ENC:PLAIN_DI [more]... in_stock: BOOLEAN SNAPPY DO:0 FPO:539 SZ:30/28/0.93 VC:3 ENC:RLE,PLAIN
name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3
quantity TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
price TV=3 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
date TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10
row group 1 of 1, values 1 to 3
value 1: R:0 D:1 V:2.5
value 2: R:0 D:1 V:2.5
value 3: R:0 D:0 V:
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534703669570 value 2: R:0 D:0 V:1534703669570 value 3: R:0 D:0 V:1534703669570
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true
I'm not entirely sure here, but using pyarrow I get the following error message:
In [1]: import pyarrow.parquet as pq
In [2]: table = pq.read_table('./fruits.parquet')
---------------------------------------------------------------------------
ArrowNotImplementedError: No support for reading columns of type list<colour: string not null>
There is a problem with repeated UTF8. Not sure whether it's parquet-cpp
in the background here or not, but I noticed in your parquet-cpp output that the "repeat" property was missing from the color
definition.
Try removing color
(with UTF8 repeated
) from the parquetjs file and try again?
I had trouble getting the more complex file to write via node-parquet (they have a more complex schema definition and data structure) so I just used the simple one from the front page as it also demonstrates the issue.
{ name: { type: 'string' }, quantity: { type: 'int64' }, price: { optional: true, type: 'double' }, date: { type: 'timestamp' }, in_stock: { type: 'bool' }
parquetjs:
var schema = new parquet.ParquetSchema({ name: { type: 'UTF8'}, quantity: { type: 'INT64'}, price: { type: 'DOUBLE', optional: true }, date: { type: 'TIMESTAMP_MILLIS'}, in_stock: { type: 'BOOLEAN'} })
Trying to read the simple case using parquet-cpp from parquetjs:
{ version: 0, createdBy: 'parquet.js', rowGroups: 1, columns: 5, rows: 3, root: { name: { type: 'string' }, quantity: { type: 'int64' }, price: { optional: true, type: 'double' }, date: { type: 'timestamp' }, in_stock: { type: 'bool' } } } [ [ undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined ] ]
tools:
name: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE quantity: INT64 UNCOMPRESSED DO:0 FPO:154 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE price: DOUBLE UNCOMPRESSED DO:0 FPO:323 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE date: INT64 UNCOMPRESSED DO:0 FPO:485 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:650 SZ:41/41/1.00 VC:3 ENC:PLAIN,RLE
name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:30 VC:3
quantity TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 10, max: 10, num_nulls: 0] SZ:24 [more]...
price TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 2.50000, max: 2.50000, [more]... VC:3
date TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 1534718747779, max: 15 [more]... VC:3
in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: true, max: true, num_nulls: 0] [more]...
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:apples value 2: R:0 D:0 V:oranges value 3: R:0 D:0 V:pears
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10
row group 1 of 1, values 1 to 3
value 1: R:0 D:1 V:2.5
value 2: R:0 D:1 V:2.5
value 3: R:0 D:0 V:
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534718747779 value 2: R:0 D:0 V:1534718747779 value 3: R:0 D:0 V:1534718747780
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true
writing from node-parquet and reading with parquetjs:
{ name: true, quantity: '10', price: '2.5', date: Invalid Date, in_stock: true } { name: true, quantity: '10', price: '2.5', date: Invalid Date, in_stock: true } { name: true, quantity: '10', date: Invalid Date, in_stock: true }
tools:
name: BOOLEAN UNCOMPRESSED DO:0 FPO:4 SZ:28/28/1.00 VC:3 ENC:RLE,PLAIN quantity: INT64 UNCOMPRESSED DO:76 FPO:98 SZ:66/66/1.00 VC:3 ENC:PLAI [more]... price: DOUBLE UNCOMPRESSED DO:226 FPO:248 SZ:72/72/1.00 VC:3 ENC:P [more]... date: INT64 UNCOMPRESSED DO:379 FPO:401 SZ:66/66/1.00 VC:3 ENC:PL [more]... in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:525 SZ:28/28/1.00 VC:3 ENC:RLE,PLAIN
name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3
quantity TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
price TV=3 RL=0 DL=1 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
date TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY [more]... VC:3
in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10
row group 1 of 1, values 1 to 3
value 1: R:0 D:1 V:2.5
value 2: R:0 D:1 V:2.5
value 3: R:0 D:0 V:
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534718580347 value 2: R:0 D:0 V:1534718580347 value 3: R:0 D:0 V:1534718580347
row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true
parquet-tools reads both, but parquetjs doesn't correctly read from parquet-cpp (node-parquet or dremio) and vice versa.
Can you upload a sample file created by parquet-cpp that is failing reading on parquetjs, that might help debugging the discrepancy?
parquet.tar.gz One created by parquetjs and the other by node-parquets
parquetjs gives invalid dates for the node-parquet file, and node-parquet gives undefined values for the parquetjs file (dremio just reports an error reading the names)
Thanks @sfescape - at least I fixed the invalid dates here: https://github.com/ZJONSSON/parquetjs/commit/c9579d283d9e09bb01927144378246070127a177
Good that it validates the output of the node-parquet file (which is readable by dremio as well). Hopefully there are good clues to get compatibility going the other way as well.
I wrote a larger file using parquetjs with optional fields and SNAPPY compression. I can read the file using parquetjs but not using parquet-tools or parquet-cpp. I'll see if I can reproduce with a smaller amount of data and data that I can post. This is probably not enough info to determine the issue, but I'm not sure what helps in debugging at this level.
The docker container I used is publicly available so you can run parquet-tools against any parquet file you generate via parquetjs.
Here is the error:
docker run --rm -v pwd
:/data nathanhowell/parquet-tools meta --debug /data/0_0_0.parquet
java.io.IOException: Could not read footer: java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:269)
at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:210)
at org.apache.parquet.hadoop.ParquetFileReader.readFooters(ParquetFileReader.java:327)
at org.apache.parquet.tools.command.ShowMetaCommand.execute(ShowMetaCommand.java:62)
at org.apache.parquet.tools.Main.main(Main.java:223)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:310)
at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:49)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:349)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360)
at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259)
at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Could not read footer: java.lang.ArrayIndexOutOfBoundsException: 7
sorry miss clicked the button... :(
So, I think this is related to optional columns which do not appear in the dataset at all. Parquet-tools is throwing errors in that case.
Interesting - if you use a compression other than SNAPPY do you still get the same errors? (if not, this might be SNAPPY related)
It happens uncompressed as well.
If you just add an optional timestamp field to the fruits example and run it (so the new field has no data), and then run the docker command I gave above you should see the issue.
May be true of all datatypes, I haven't run them all through the same test.
update: you don't get the error with UTF8 datatype.
Effected datatypes:
BOOLEAN DOUBLE TIMESTAMP_MILLIS INT32
Seems to be any datatype that needs a conversion? So, all except binary?
Attached please find two files, both written with the following schema
{ name: { type: 'UTF8' }, my_boolean: { type: 'BOOLEAN', optional: true } }
One row written: {name:'hello'}
One written by node-parquet (parquet-cpp) can be read by all three readers (node-parquet, parquetjs, parquet-tools). The second can only be read by parquetjs. Hope this helps!
Ok, I've figure out that some of this has been caused because parquetjs defaulting to V2 pages. I don't really know what V2 pages are, but I assumed because there was an option useDataPagesV2 that the default was not to use them so I didn't test it. After reading the code I realized it was the default and now (along with the date fix) node-parquet an parquetjs work much better together.
Dremio still has a problem reading parquetjs written files when optional fields are present. I'll continue to research that.
I'm not clear why data pages V2 is the default and suggest perhaps it should not be for compatibility with other systems.
The only remaining compatibility issue I'm seeing is that node-parquet and parquetjs can read files with optional columns that contain no data. However, parquet-tools and dremio can not do that.
This is where things go wrong in parquet-mr:
if (statistics.isSetMax() && statistics.isSetMin()) { stats.setMinMaxFromBytes(statistics.min.array(), statistics.max.array()); }
It thinks there is a min and max data length? And tries to read that min/max data length (which I expect is not defined in the parquetjs output).
I figured out the issue. When there are only nulls in the column parquetjs is writing an empty buffer rather then null for the min/max column statistics.
setting statistics.min_value and statistics.max_value to null if the encoded buffer length is zero causes all the libraries to be able to read the output correctly.
Awesome!!!! Do you want to submit a PR?
I can change it, but I don't know how I would write a test that fails without the change since it only shows up when using these other libraries. This function (encodeStatisticsValue) is also used to encode column statistics and while it's not failing with the other libraries I'm not sure if there is a similar case there under some circumstance.
Right now I'm looking at the Data Page V2 code to see if I can figure out why it doesn't work with parquet-cpp when turned on.
Update: parquet-cpp didn't read the statistics because I think it just won't when using UTF8. Switching to int64 as my example did get the statistics read. Still the values are not read properly, so still don't see why data page V2 doesn't work with parquet-cpp.
I found the root cause. parquet-cpp just doesn't support DATA_PAGE_V2. So I added compression support for DATA_PAGE and just won't use DATA_PAGE_V2.
I suggest that since not all of the major libraries support DATA_PAGE_V2 yet that the default should be data page.
This is very helpful @sfescape. It is a little bit annoying that parquet-cpp does not throw an exception upon DATA_PAGE_V2 (maybe we should raise issue or do PR). Are there any other major libraries with same limitation? To my understanding, majority of users are using parquet-mr (with spark for example).
At the very least we should clarify limitations of parquet-cpp in the README of parquetjs. We could change the default model, but I'm not sure if that is the ideal.
I only know parquet-mr and parquet-cpp, so I don't know about other libraries. parquet-mr was fine, it was parquet-cpp that seems to be lagging. I know they've had some trouble getting changes released and are voting right now on combining the arrow and parquet repositories in order to make that better.
I couldn't even find the DATA_PAGE_V2 vs DATA_PAGE specifications, the documentation is pretty sparse or very hard to find. I haven't looked at C++ code for many years so figuring out they could read the thrift properly, but dropped it at a higher level took a lot of time.
In the end, I was able to figure out that the file written by parquet-cpp was DATA_PAGE and from there figured out how to add DATA_PAGE compression to parquetjs. That's the most important part for me, that I can now write files that can be read by parquet-cpp based tools and it's compressed.
Thank you @sfescape - would you mind updating the README to help anyone who has same problems with partquet-cpp
- just a quick note?
Thanks for all the work here @sfescape - I think it's safe to close this issue now, feel free to reopen as needed and please report any issues or improvements that would make sense!
I used the example to write a parquet file, and then tried to read it using parquet-cpp and got the following output (note: the file is readable using parquet-tools):
{ version: 0, createdBy: 'parquet.js', rowGroups: 1, columns: 7, rows: 3, root: { name: { type: 'string' }, quantity: { optional: true, type: 'int64' }, price: { type: 'double' }, date: { type: 'int64' }, in_stock: { type: 'bool' }, colour: { type: 'string' }, meta_json: { optional: true, type: 'byte_array' } } } [ [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ] ]