Closed vmarchaud closed 2 months ago
First off, thanks for the test and including the file!
I would expect those to come out as buffers as well right now. They are FIXED_LEN_BYTE_ARRAY
under the hood.
The other output option would be strings as JS only supports up to 53 bit numbers.
Looks like the issue is because this file uses a dictionary and dictionaries get a "toString" (wrongly) applied: https://github.com/LibertyDSNP/parquetjs/blame/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L948
However removing that looks like it causes some other tests to fail, so some version of that is needed for some values.
All the failing tests however are in the test-files.js
test, so perhaps some of them are wrong? I might be able to take a deeper look in a few weeks, but perhaps that is enough that you can find the deeper issue faster than I will be able to.
I've started to have a look into the FIXED_LEN_BYTE_ARRAY
handling as I've been looking into using this library and found the same issue with decimal parsing.
If you remove that toString
call on the dictionary decoding the test failures seem to be just in the test-files.js
:
binary_field
now gets returned as a buffer, and the long_field
as a BigInt. Both were previously strings.I'm not familiar with the parquet format details and what the distinction is between the DATA_PAGE
, DATA_PAGE_V2
and DICTIONARY_PAGE
pagetypes, so I'm not sure whether the other routes were returning non-string data. I'd imagine it's a breaking change to update how the data is returned for this dictionary page case though, and what the wider impacts would be to any of the sample file parsing.
If getting rid of the toString
conversion seems like the way to go, I'm happy to raise a PR that updates some of these tests to use the underlying type to see the change.
It could be worth adding an equivalent test like the test.parquet
that checks the data gets returned as the expected types, but I'd need to figure out how to make a parquet file that uses a dictionary for that.
Did a bit more digging, and saw the parquet-testing files. It looks like this library is decoding parquet files with dictionary pages differently to non-dictionary pages.
If you compare alltypes_tiny_pages_plain.parquet
and alltypes_tiny_pages.parquet
, one gets returned with string values and the other gets returned with non-string values.
I just added a console.log(record)
to the read-all.test.ts
after filtering down to these two files:
The first record from alltypes_tiny_pages.parquet
{
id: 122,
bool_col: true,
tinyint_col: '2',
smallint_col: '2',
int_col: '2',
bigint_col: '20',
float_col: '2.200000047683716',
double_col: '20.2',
date_string_col: '01/13/09',
string_col: '2',
timestamp_col: '3725410000000',
year: '2009',
month: '1'
}
The first record from alltypes_tiny_pages_plain.parquet
{
id: 122,
bool_col: true,
tinyint_col: 2,
smallint_col: 2,
int_col: 2,
bigint_col: 20n,
float_col: 2.200000047683716,
double_col: 20.2,
date_string_col: '01/13/09',
string_col: '2',
timestamp_col: 3725410000000,
year: 2009,
month: 1
}
And looks like the former included a dictionary page and the latter did not.
And then re-running it without that toString
mapping makes them give the same result.
Should be fixed in https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.8.0
Thanks for reporting an issue!
Steps to reproduce
Schema:
File: e2e_datasources.bigquery_test_c40ff3c5-03f4-4213-9d52-fc62e71af0ed_1710089629994_file-000000000000.parquet.gz
Expected behaviour
We should decode age as number or at least as a buffer
Actual behaviour
Any other comments?
I'm not familiar with parquet encodings, actually started working with it this afternoon so i might be doing something wrong. I would expect to have the number decoded from decimal however i've seen in other tests that since decimal are encoded as a FIXED_LEN_BYTE_ARRAY in my case it should be decoded as a buffer but that's not the case either.