LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
43 stars 24 forks source link

invalid encoding: RLE_DICTIONARY #96

Closed ampledata closed 5 months ago

ampledata commented 11 months ago

I'm not sure if there's a problem with the parquet data I'm using, or if this is a bug in the library, but filing anyway.

Steps to reproduce

  1. Create a parquet file with RLE_DICTIONARY encoding.
  2. Parse the file with the reader example: https://github.com/LibertyDSNP/parquetjs/blob/main/examples/reader.js

Expected behaviour

Parquet file should be written to the console (in JSON?).

Actual behaviour

Node raises an exception.

Any logs, error output, etc?

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "invalid encoding: RLE_DICTIONARY".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v18.17.0

Any other comments?

parquet-tools, which uses the same parquet.thrift as parquetjs, parses the file OK.

From what I can tell, https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.ts#L704 attempts to load the codec for RLE_DICTIONARY from the parquet_codec hash, as imported via import * as parquet_codec from './codec';.

JoeCap08055 commented 11 months ago

Can you provide your test file?

ampledata commented 11 months ago

Location.parquet.zip

JoeCap08055 commented 11 months ago

I believe parquetjs does not yet support Parquet 2.0 with RLE_DICTIONARY.

ampledata commented 11 months ago

Thanks. Any idea if LoE or possible paths to take to add this, so I could look into adding this myself?

JoeCap08055 commented 10 months ago

No idea, my first time looking at this repo. @wilwade ?

wilwade commented 10 months ago

@ampledata

First, I want to thank you for asking how to help. ❤️

I don't think the effort is too large.

  1. We can likely separate out the read from the write side, so focusing on the read side first is best (even if you want to do both).
  2. Your notes are correct. The lib/codec/index.ts file needs to also have a line like this: export * as RLE_DICTIONARY from './rle_dictionary'
  3. Of course that means we also need: lib/codec/rle_dictionary.ts
  4. I think it will not be that far off from parts of lib/codec/rle.tsand parts of lib/codec/plain_dictionary.ts
  5. In theory the reference docs are here: https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

How does that sound as a starting point?

saritvakrat commented 5 months ago

I encounter this as well when using the V2, any suggestion how to bypass this? I only need to read the file

wilwade commented 5 months ago

@saritvakrat The read (or write) portion for this isn't there, and I don't think anyone is working on it currently. If you have an example file you can share, that would be helpful when someone does tackle it.

If you are interested in adding support, I'd add a simplist possible file to the test/test-files.js that has the encoding. Then work through the read side with that as your guide.

saritvakrat commented 5 months ago

@wilwade Thank you for the quick response. I can't attach an example file due to the sensitivity of the information. However, this is the metadata for the file: file written by pyarrow 11.0.0 created_by: parquet-cpp-arrow version 11.0.0 num_columns: 6 num_rows: 42 num_row_groups: 1 format_version: 2.6 serialized_size: 3975

In my case I only need to read the file (consumed from S3 bucket). Is it possible to add support to the read part?

Many thanks!

bmmeijers commented 5 months ago

I've tried the following to get support for RLE_DICT.

Add inside (codec/index.ts):

export * as RLE_DICTIONARY from './plain_dictionary'

and inside codec/index.js

exports.RLE_DICTIONARY = require("./plain_dictionary");

It seems to work with Location.parquet.zip (Originally posted by @ampledata in https://github.com/LibertyDSNP/parquetjs/issues/96#issuecomment-1668512325).

See also https://github.com/aloneguid/parquet-dotnet/issues/107, where a similar fix was applied to a c# library for reading parquet.

saritvakrat commented 5 months ago

@bmmeijers Thank you it works like this! but every time someone will run npm install and they wont have this package this will not be stored.. I need a permeant solution. Any idea how can I make those files not to change?

bmmeijers commented 5 months ago

I think it would require adding the suggested changes, and possibly also adding some tests / doing tests with files that have this encoding to see if it works as expected.

Then, a fix needs to be made available, e.g. as pull request (if you have the capability to do this yourself, that would be possible, it's an open source project) and the maintainers ( @wilwade ) need to accept that. Not sure if they have extra ideas on this...

saritvakrat commented 5 months ago

Added PR thanks @bmmeijers https://github.com/LibertyDSNP/parquetjs/pull/112

wilwade commented 5 months ago

Closed by #112

wilwade commented 5 months ago

Release: https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.4.1