Closed davidtsai closed 9 months ago
PR likely still needs tests etc to be mergeable, but contains proof of concept to parse parquet files with large integers/decimals properly.
@davidtsai, will this PR also bring support for parquet files where TIMESTAMP
columns are encoded as INT96
?
Example file see here: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet
@keen85 it's halfway there to being able to support it. Right now I'm handling it in our application code:
function parquetInt96DateToLuxon(int96: bigint, timezone?: string) {
// Extract nanoseconds and Julian day number
const nanoseconds = int96 & BigInt('0xFFFFFFFFFFFFFFFF'); // first 8 bytes
const julianDay = int96 >> BigInt(64); // last 4 bytes
// Julian day number for Unix epoch (January 1, 1970)
const unixEpochJulianDay = BigInt(2440588);
// Calculate the difference in days between the Julian day and the Unix epoch
const daysSinceEpoch = julianDay - unixEpochJulianDay;
// Convert days to milliseconds
// 86400000 milliseconds in a day
const millisecondsSinceEpoch = daysSinceEpoch * BigInt(86400000);
// Convert nanoseconds to milliseconds and add to the Unix timestamp
const totalMilliseconds = millisecondsSinceEpoch + (nanoseconds / BigInt(1000000));
// Create a DateTime object in UTC
const date = DateTime.fromMillis(Number(totalMilliseconds), { zone: 'utc' });
if (timezone) {
// // Convert to the specified timezone
return date.setZone(timezone, { keepLocalTime: true });
}
return date;
}
I'm not sure when we can reliably assume the INT96 column is a date in a parquet file. If there is documented convention for that, would be easy enough to add the above function to this library itself.
@davidtsai awesome, thanks for your work!
If there is documented convention for that
INT96
does not appear any more officially in parquet-docs. The code still contains it, also with hint to deprecation: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L36INT96
for TIMESTAMP encoding:
...not sure if this is proof enough 😅
@davidtsai Thanks for the PR!
Enabled tests, and seeing some fail, but looking closer, I think they might have been bad tests before perhaps? Or might be a mix. Tests should now run for this PR automatically. (At least I think that's what I told GitHub)
Let me know if you want/need some help, although you likely understand this part of the codebase better than I now.
Yes, the test was looking for strings, so they would need to be updated to match numbers and bigints. It is a breaking API change though, but in my opinion the correct behavior to not always return strings. I can work on this PR more in about a week to get it over the finish line if it'll be helpful!
On Thu, Jan 18, 2024 at 11:57 AM Wil Wade @.***> wrote:
@.**** commented on this pull request.
In lib/reader.ts https://github.com/LibertyDSNP/parquetjs/pull/109#discussion_r1457919099 :
@@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH }; }
- return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts)
- .map((d:Array
) => d.toString()); I think removing this has caused some of the test errors
— Reply to this email directly, view it on GitHub https://github.com/LibertyDSNP/parquetjs/pull/109#pullrequestreview-1830460046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJMWIFSZQ5MGLDXQOPSYUTYPF5B5AVCNFSM6AAAAABBPLE3N2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZQGQ3DAMBUGY . You are receiving this because you were mentioned.Message ID: @.***>
@davidtsai Javascript is limited to 53 bit numbers, so doesn't it need to be something besides a native number for 96?
Closing as stale. Please reopen if this is not so.
Problem
When reading parquet files:
Solution
Change summary:
Steps to Verify: