Closed ampledata closed 5 months ago
Can you provide your test file?
I believe parquetjs does not yet support Parquet 2.0 with RLE_DICTIONARY.
Thanks. Any idea if LoE or possible paths to take to add this, so I could look into adding this myself?
No idea, my first time looking at this repo. @wilwade ?
@ampledata
First, I want to thank you for asking how to help. ❤️
I don't think the effort is too large.
lib/codec/index.ts
file needs to also have a line like this:
export * as RLE_DICTIONARY from './rle_dictionary'
lib/codec/rle_dictionary.ts
lib/codec/rle.ts
and parts of lib/codec/plain_dictionary.ts
How does that sound as a starting point?
I encounter this as well when using the V2, any suggestion how to bypass this? I only need to read the file
@saritvakrat The read (or write) portion for this isn't there, and I don't think anyone is working on it currently. If you have an example file you can share, that would be helpful when someone does tackle it.
If you are interested in adding support, I'd add a simplist possible file to the test/test-files.js
that has the encoding. Then work through the read side with that as your guide.
@wilwade Thank you for the quick response. I can't attach an example file due to the sensitivity of the information. However, this is the metadata for the file: file written by pyarrow 11.0.0 created_by: parquet-cpp-arrow version 11.0.0 num_columns: 6 num_rows: 42 num_row_groups: 1 format_version: 2.6 serialized_size: 3975
In my case I only need to read the file (consumed from S3 bucket). Is it possible to add support to the read part?
Many thanks!
I've tried the following to get support for RLE_DICT.
Add inside (codec/index.ts
):
export * as RLE_DICTIONARY from './plain_dictionary'
and inside codec/index.js
exports.RLE_DICTIONARY = require("./plain_dictionary");
It seems to work with Location.parquet.zip (Originally posted by @ampledata in https://github.com/LibertyDSNP/parquetjs/issues/96#issuecomment-1668512325).
See also https://github.com/aloneguid/parquet-dotnet/issues/107, where a similar fix was applied to a c# library for reading parquet.
@bmmeijers Thank you it works like this! but every time someone will run npm install and they wont have this package this will not be stored.. I need a permeant solution. Any idea how can I make those files not to change?
I think it would require adding the suggested changes, and possibly also adding some tests / doing tests with files that have this encoding to see if it works as expected.
Then, a fix needs to be made available, e.g. as pull request (if you have the capability to do this yourself, that would be possible, it's an open source project) and the maintainers ( @wilwade ) need to accept that. Not sure if they have extra ideas on this...
Added PR thanks @bmmeijers https://github.com/LibertyDSNP/parquetjs/pull/112
Closed by #112
I'm not sure if there's a problem with the parquet data I'm using, or if this is a bug in the library, but filing anyway.
Steps to reproduce
Expected behaviour
Parquet file should be written to the console (in JSON?).
Actual behaviour
Node raises an exception.
Any logs, error output, etc?
Any other comments?
parquet-tools, which uses the same parquet.thrift as parquetjs, parses the file OK.
From what I can tell, https://github.com/LibertyDSNP/parquetjs/blob/main/lib/reader.ts#L704 attempts to load the codec for
RLE_DICTIONARY
from theparquet_codec
hash, as imported viaimport * as parquet_codec from './codec';
.