ZJONSSON / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
34 stars 61 forks source link

written files not readable by parquet-cpp #24

Closed sfescape closed 5 years ago

sfescape commented 5 years ago

I used the example to write a parquet file, and then tried to read it using parquet-cpp and got the following output (note: the file is readable using parquet-tools):

{ version: 0, createdBy: 'parquet.js', rowGroups: 1, columns: 7, rows: 3, root: { name: { type: 'string' }, quantity: { optional: true, type: 'int64' }, price: { type: 'double' }, date: { type: 'int64' }, in_stock: { type: 'bool' }, colour: { type: 'string' }, meta_json: { optional: true, type: 'byte_array' } } } [ [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined, undefined, undefined ] ]

ZJONSSON commented 5 years ago

Can you please post the entire code/cmds required to replicate. It's hard to see whether this is a parquetjs or parquet-cpp issue without any futher information.

Can you try writing the same file with parquet-cpp and read it with parquet-tools and parquetjs?

sfescape commented 5 years ago

The command to write was example/writers.js. Using the simple example on the front page produces similar output.

I used https://github.com/skale-me/node-parquet to read the file.

`var parquet = require('./index');

var reader = new parquet.ParquetReader('fruits.parquet'); console.log(reader.info()); console.log(reader.rows()); reader.close();`

Writing a simple file using node-parquet is readable by parquetjs (although parquetjs complains that the date fields are invalid which I haven't looked at yet. They looked fine at first glance using parquet-tools).

As I mentioned in the other issue I filed, files written by parquetjs are also not readable by dremio (which also uses parquet-cpp). It's quite a pain to build stuff based on parquet-cpp.

sfescape commented 5 years ago

Oh, all of the output of parquetjs is readable by parquet-tools. Same for the node-parquet output.

row group 0

name: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:91/91/1.00 VC:3 ENC:PLAIN,RLE quantity: INT64 UNCOMPRESSED DO:0 FPO:161 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE price: DOUBLE UNCOMPRESSED DO:0 FPO:326 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE date: INT64 UNCOMPRESSED DO:0 FPO:492 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:657 SZ:41/41/1.00 VC:3 ENC:PLAIN,RLE colour: BINARY UNCOMPRESSED DO:0 FPO:745 SZ:100/100/1.00 VC:5 ENC:PLAIN,RLE meta_json: BINARY UNCOMPRESSED DO:0 FPO:904 SZ:209/209/1.00 VC:3 ENC:PLAIN,RLE

name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:29 VC:3

quantity TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 10, max: 20, num_nulls: 1] SZ:20 [more]...

price TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 2.60000, max: 4.20000, [more]... VC:3

date TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 1534704624839000, max: [more]... VC:3

in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: false, max: true, num_nulls: 0] [more]...

colour TV=5 RL=1 DL=1
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:48 VC:5

meta_json TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:41 VC:3

BINARY name

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:apples value 2: R:0 D:0 V:oranges value 3: R:0 D:0 V:kiwi

INT64 quantity

row group 1 of 1, values 1 to 3 value 1: R:0 D:1 V:10 value 2: R:0 D:1 V:20 value 3: R:0 D:0 V:

DOUBLE price

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:2.6 value 2: R:0 D:0 V:2.7 value 3: R:0 D:0 V:4.2

INT64 date

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534704624839000 value 2: R:0 D:0 V:1534704624840000 value 3: R:0 D:0 V:1534704624840000

BOOLEAN in_stock

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:false

BINARY colour

row group 1 of 1, values 1 to 5 value 1: R:0 D:1 V:green value 2: R:1 D:1 V:red value 3: R:0 D:1 V:orange value 4: R:0 D:1 V:green value 5: R:1 D:1 V:brown

BINARY meta_json

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V: value 2: R:0 D:0 V: value 3: R:0 D:1 V:

sfescape commented 5 years ago

parquet_tools output for the node-parquet file:

row group 0

name: BOOLEAN SNAPPY DO:0 FPO:4 SZ:30/28/0.93 VC:3 ENC:RLE,PLAIN quantity: INT64 SNAPPY DO:78 FPO:102 SZ:70/66/0.94 VC:3 ENC:PLAIN_DIC [more]... price: DOUBLE SNAPPY DO:232 FPO:256 SZ:76/72/0.95 VC:3 ENC:PLAIN_D [more]... date: INT64 SNAPPY DO:389 FPO:413 SZ:70/66/0.94 VC:3 ENC:PLAIN_DI [more]... in_stock: BOOLEAN SNAPPY DO:0 FPO:539 SZ:30/28/0.93 VC:3 ENC:RLE,PLAIN

name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3

quantity TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

price TV=3 RL=0 DL=1 DS:    1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

date TV=3 RL=0 DL=0 DS:     1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3

BOOLEAN name

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true

INT64 quantity

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10

DOUBLE price

row group 1 of 1, values 1 to 3 value 1: R:0 D:1 V:2.5 value 2: R:0 D:1 V:2.5 value 3: R:0 D:0 V:

INT64 date

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534703669570 value 2: R:0 D:0 V:1534703669570 value 3: R:0 D:0 V:1534703669570

BOOLEAN in_stock

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true

ZJONSSON commented 5 years ago

I'm not entirely sure here, but using pyarrow I get the following error message:

In [1]: import pyarrow.parquet as pq

In [2]: table = pq.read_table('./fruits.parquet')
---------------------------------------------------------------------------
ArrowNotImplementedError: No support for reading columns of type list<colour: string not null>

There is a problem with repeated UTF8. Not sure whether it's parquet-cpp in the background here or not, but I noticed in your parquet-cpp output that the "repeat" property was missing from the color definition.

Try removing color (with UTF8 repeated) from the parquetjs file and try again?

sfescape commented 5 years ago

I had trouble getting the more complex file to write via node-parquet (they have a more complex schema definition and data structure) so I just used the simple one from the front page as it also demonstrates the issue.

{ name: { type: 'string' }, quantity: { type: 'int64' }, price: { optional: true, type: 'double' }, date: { type: 'timestamp' }, in_stock: { type: 'bool' }

parquetjs:

var schema = new parquet.ParquetSchema({ name: { type: 'UTF8'}, quantity: { type: 'INT64'}, price: { type: 'DOUBLE', optional: true }, date: { type: 'TIMESTAMP_MILLIS'}, in_stock: { type: 'BOOLEAN'} })

Trying to read the simple case using parquet-cpp from parquetjs:

{ version: 0, createdBy: 'parquet.js', rowGroups: 1, columns: 5, rows: 3, root: { name: { type: 'string' }, quantity: { type: 'int64' }, price: { optional: true, type: 'double' }, date: { type: 'timestamp' }, in_stock: { type: 'bool' } } } [ [ undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined ], [ undefined, undefined, undefined, undefined, undefined ] ]

tools:

row group 0

name: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE quantity: INT64 UNCOMPRESSED DO:0 FPO:154 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE price: DOUBLE UNCOMPRESSED DO:0 FPO:323 SZ:88/88/1.00 VC:3 ENC:PLAIN,RLE date: INT64 UNCOMPRESSED DO:0 FPO:485 SZ:92/92/1.00 VC:3 ENC:PLAIN,RLE in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:650 SZ:41/41/1.00 VC:3 ENC:PLAIN,RLE

name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[no stats for this column] SZ:30 VC:3

quantity TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 10, max: 10, num_nulls: 0] SZ:24 [more]...

price TV=3 RL=0 DL=1
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 2.50000, max: 2.50000, [more]... VC:3

date TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: 1534718747779, max: 15 [more]... VC:3

in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:  DLE:RLE RLE:RLE VLE:PLAIN ST:[min: true, max: true, num_nulls: 0] [more]...

BINARY name

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:apples value 2: R:0 D:0 V:oranges value 3: R:0 D:0 V:pears

INT64 quantity

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10

DOUBLE price

row group 1 of 1, values 1 to 3 value 1: R:0 D:1 V:2.5 value 2: R:0 D:1 V:2.5 value 3: R:0 D:0 V:

INT64 date

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534718747779 value 2: R:0 D:0 V:1534718747779 value 3: R:0 D:0 V:1534718747780

BOOLEAN in_stock

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true

writing from node-parquet and reading with parquetjs:

{ name: true, quantity: '10', price: '2.5', date: Invalid Date, in_stock: true } { name: true, quantity: '10', price: '2.5', date: Invalid Date, in_stock: true } { name: true, quantity: '10', date: Invalid Date, in_stock: true }

tools:

row group 0

name: BOOLEAN UNCOMPRESSED DO:0 FPO:4 SZ:28/28/1.00 VC:3 ENC:RLE,PLAIN quantity: INT64 UNCOMPRESSED DO:76 FPO:98 SZ:66/66/1.00 VC:3 ENC:PLAI [more]... price: DOUBLE UNCOMPRESSED DO:226 FPO:248 SZ:72/72/1.00 VC:3 ENC:P [more]... date: INT64 UNCOMPRESSED DO:379 FPO:401 SZ:66/66/1.00 VC:3 ENC:PL [more]... in_stock: BOOLEAN UNCOMPRESSED DO:0 FPO:525 SZ:28/28/1.00 VC:3 ENC:RLE,PLAIN

name TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3

quantity TV=3 RL=0 DL=0 DS: 1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

price TV=3 RL=0 DL=1 DS:    1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

date TV=3 RL=0 DL=0 DS:     1 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN_DICTIONARY  [more]... VC:3

in_stock TV=3 RL=0 DL=0
----------------------------------------------------------------------------
page 0:                      DLE:RLE RLE:RLE VLE:PLAIN ST:[min: tr [more]... VC:3

BOOLEAN name

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true

INT64 quantity

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:10 value 2: R:0 D:0 V:10 value 3: R:0 D:0 V:10

DOUBLE price

row group 1 of 1, values 1 to 3 value 1: R:0 D:1 V:2.5 value 2: R:0 D:1 V:2.5 value 3: R:0 D:0 V:

INT64 date

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:1534718580347 value 2: R:0 D:0 V:1534718580347 value 3: R:0 D:0 V:1534718580347

BOOLEAN in_stock

row group 1 of 1, values 1 to 3 value 1: R:0 D:0 V:true value 2: R:0 D:0 V:true value 3: R:0 D:0 V:true

sfescape commented 5 years ago

parquet-tools reads both, but parquetjs doesn't correctly read from parquet-cpp (node-parquet or dremio) and vice versa.

ZJONSSON commented 5 years ago

Can you upload a sample file created by parquet-cpp that is failing reading on parquetjs, that might help debugging the discrepancy?

sfescape commented 5 years ago

parquet.tar.gz One created by parquetjs and the other by node-parquets

parquetjs gives invalid dates for the node-parquet file, and node-parquet gives undefined values for the parquetjs file (dremio just reports an error reading the names)

ZJONSSON commented 5 years ago

Thanks @sfescape - at least I fixed the invalid dates here: https://github.com/ZJONSSON/parquetjs/commit/c9579d283d9e09bb01927144378246070127a177

sfescape commented 5 years ago

Good that it validates the output of the node-parquet file (which is readable by dremio as well). Hopefully there are good clues to get compatibility going the other way as well.

sfescape commented 5 years ago

I wrote a larger file using parquetjs with optional fields and SNAPPY compression. I can read the file using parquetjs but not using parquet-tools or parquet-cpp. I'll see if I can reproduce with a smaller amount of data and data that I can post. This is probably not enough info to determine the issue, but I'm not sure what helps in debugging at this level.

The docker container I used is publicly available so you can run parquet-tools against any parquet file you generate via parquetjs.

Here is the error:

docker run --rm -v pwd:/data nathanhowell/parquet-tools meta --debug /data/0_0_0.parquet java.io.IOException: Could not read footer: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:269) at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:210) at org.apache.parquet.hadoop.ParquetFileReader.readFooters(ParquetFileReader.java:327) at org.apache.parquet.tools.command.ShowMetaCommand.execute(ShowMetaCommand.java:62) at org.apache.parquet.tools.Main.main(Main.java:223) Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:310) at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:49) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatisticsInternal(ParquetMetadataConverter.java:349) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:360) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:816) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:793) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:502) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:461) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:437) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:259) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:255) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Could not read footer: java.lang.ArrayIndexOutOfBoundsException: 7

sfescape commented 5 years ago

sorry miss clicked the button... :(

sfescape commented 5 years ago

So, I think this is related to optional columns which do not appear in the dataset at all. Parquet-tools is throwing errors in that case.

ZJONSSON commented 5 years ago

Interesting - if you use a compression other than SNAPPY do you still get the same errors? (if not, this might be SNAPPY related)

sfescape commented 5 years ago

It happens uncompressed as well.

sfescape commented 5 years ago

If you just add an optional timestamp field to the fruits example and run it (so the new field has no data), and then run the docker command I gave above you should see the issue.

May be true of all datatypes, I haven't run them all through the same test.

update: you don't get the error with UTF8 datatype.

Effected datatypes:

BOOLEAN DOUBLE TIMESTAMP_MILLIS INT32

Seems to be any datatype that needs a conversion? So, all except binary?

sfescape commented 5 years ago

Attached please find two files, both written with the following schema

{ name: { type: 'UTF8' }, my_boolean: { type: 'BOOLEAN', optional: true } }

One row written: {name:'hello'}

One written by node-parquet (parquet-cpp) can be read by all three readers (node-parquet, parquetjs, parquet-tools). The second can only be read by parquetjs. Hope this helps!

files.zip

sfescape commented 5 years ago

Ok, I've figure out that some of this has been caused because parquetjs defaulting to V2 pages. I don't really know what V2 pages are, but I assumed because there was an option useDataPagesV2 that the default was not to use them so I didn't test it. After reading the code I realized it was the default and now (along with the date fix) node-parquet an parquetjs work much better together.

Dremio still has a problem reading parquetjs written files when optional fields are present. I'll continue to research that.

I'm not clear why data pages V2 is the default and suggest perhaps it should not be for compatibility with other systems.

The only remaining compatibility issue I'm seeing is that node-parquet and parquetjs can read files with optional columns that contain no data. However, parquet-tools and dremio can not do that.

sfescape commented 5 years ago

This is where things go wrong in parquet-mr:

if (statistics.isSetMax() && statistics.isSetMin()) { stats.setMinMaxFromBytes(statistics.min.array(), statistics.max.array()); }

It thinks there is a min and max data length? And tries to read that min/max data length (which I expect is not defined in the parquetjs output).

sfescape commented 5 years ago

I figured out the issue. When there are only nulls in the column parquetjs is writing an empty buffer rather then null for the min/max column statistics.

setting statistics.min_value and statistics.max_value to null if the encoded buffer length is zero causes all the libraries to be able to read the output correctly.

ZJONSSON commented 5 years ago

Awesome!!!! Do you want to submit a PR?

sfescape commented 5 years ago

I can change it, but I don't know how I would write a test that fails without the change since it only shows up when using these other libraries. This function (encodeStatisticsValue) is also used to encode column statistics and while it's not failing with the other libraries I'm not sure if there is a similar case there under some circumstance.

Right now I'm looking at the Data Page V2 code to see if I can figure out why it doesn't work with parquet-cpp when turned on.

sfescape commented 5 years ago

Update: parquet-cpp didn't read the statistics because I think it just won't when using UTF8. Switching to int64 as my example did get the statistics read. Still the values are not read properly, so still don't see why data page V2 doesn't work with parquet-cpp.

sfescape commented 5 years ago

I found the root cause. parquet-cpp just doesn't support DATA_PAGE_V2. So I added compression support for DATA_PAGE and just won't use DATA_PAGE_V2.

I suggest that since not all of the major libraries support DATA_PAGE_V2 yet that the default should be data page.

ZJONSSON commented 5 years ago

This is very helpful @sfescape. It is a little bit annoying that parquet-cpp does not throw an exception upon DATA_PAGE_V2 (maybe we should raise issue or do PR). Are there any other major libraries with same limitation? To my understanding, majority of users are using parquet-mr (with spark for example).

At the very least we should clarify limitations of parquet-cpp in the README of parquetjs. We could change the default model, but I'm not sure if that is the ideal.

sfescape commented 5 years ago

I only know parquet-mr and parquet-cpp, so I don't know about other libraries. parquet-mr was fine, it was parquet-cpp that seems to be lagging. I know they've had some trouble getting changes released and are voting right now on combining the arrow and parquet repositories in order to make that better.

I couldn't even find the DATA_PAGE_V2 vs DATA_PAGE specifications, the documentation is pretty sparse or very hard to find. I haven't looked at C++ code for many years so figuring out they could read the thrift properly, but dropped it at a higher level took a lot of time.

In the end, I was able to figure out that the file written by parquet-cpp was DATA_PAGE and from there figured out how to add DATA_PAGE compression to parquetjs. That's the most important part for me, that I can now write files that can be read by parquet-cpp based tools and it's compressed.

ZJONSSON commented 5 years ago

Thank you @sfescape - would you mind updating the README to help anyone who has same problems with partquet-cpp - just a quick note?

ZJONSSON commented 5 years ago

Thanks for all the work here @sfescape - I think it's safe to close this issue now, feel free to reopen as needed and please report any issues or improvements that would make sense!