davidtsai commented 10 months ago

Problem

When reading parquet files:

INT96 columns are not being parsed correctly and are being truncated.
Files with large DECIMAL columns stored in FIXED_LEN_BYTE_ARRAY is not supported

Solution

Parses INT96 columns into a BigInt to return untruncated values.
Adds support for parsing FIXED_LEN_BTYE_ARRAY columns into arbitrary precision DECIMALs.

Change summary:

Tidy, well formulated commit message
Another great commit message
Something else I/we did

Steps to Verify:

A setup step / beginning state
What to do next
Any other instructions
Expected behavior
Suggestions for testing

davidtsai commented 10 months ago

PR likely still needs tests etc to be mergeable, but contains proof of concept to parse parquet files with large integers/decimals properly.

keen85 commented 10 months ago

@davidtsai, will this PR also bring support for parquet files where TIMESTAMP columns are encoded as INT96? Example file see here: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet

davidtsai commented 10 months ago

@keen85 it's halfway there to being able to support it. Right now I'm handling it in our application code:

function parquetInt96DateToLuxon(int96: bigint, timezone?: string) {
  // Extract nanoseconds and Julian day number
  const nanoseconds = int96 & BigInt('0xFFFFFFFFFFFFFFFF'); // first 8 bytes
  const julianDay = int96 >> BigInt(64); // last 4 bytes

  // Julian day number for Unix epoch (January 1, 1970)
  const unixEpochJulianDay = BigInt(2440588);

  // Calculate the difference in days between the Julian day and the Unix epoch
  const daysSinceEpoch = julianDay - unixEpochJulianDay;

  // Convert days to milliseconds
  // 86400000 milliseconds in a day
  const millisecondsSinceEpoch = daysSinceEpoch * BigInt(86400000);

  // Convert nanoseconds to milliseconds and add to the Unix timestamp
  const totalMilliseconds = millisecondsSinceEpoch + (nanoseconds / BigInt(1000000));

  // Create a DateTime object in UTC
  const date = DateTime.fromMillis(Number(totalMilliseconds), { zone: 'utc' });

  if (timezone) {
    // // Convert to the specified timezone
    return date.setZone(timezone, { keepLocalTime: true });
  }
  return date;
}

I'm not sure when we can reliably assume the INT96 column is a date in a parquet file. If there is documented convention for that, would be easy enough to add the above function to this library itself.

keen85 commented 10 months ago

@davidtsai awesome, thanks for your work!

If there is documented convention for that

https://issues.apache.org/jira/browse/PARQUET-323: "INT96 is only used to represent nanosec timestamp in Impala for some historical reasons, and should be deprecated"
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md: INT96 does not appear any more officially in parquet-docs. The code still contains it, also with hint to deprecation: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L36
popular data processing engines that still use/support INT96 for TIMESTAMP encoding:

...not sure if this is proof enough 😅

wilwade commented 10 months ago

@davidtsai Thanks for the PR!

Enabled tests, and seeing some fail, but looking closer, I think they might have been bad tests before perhaps? Or might be a mix. Tests should now run for this PR automatically. (At least I think that's what I told GitHub)

Let me know if you want/need some help, although you likely understand this part of the codebase better than I now.

davidtsai commented 10 months ago

Yes, the test was looking for strings, so they would need to be updated to match numbers and bigints. It is a breaking API change though, but in my opinion the correct behavior to not always return strings. I can work on this PR more in about a week to get it over the finish line if it'll be helpful!

David

On Thu, Jan 18, 2024 at 11:57 AM Wil Wade @.***> wrote:

@.**** commented on this pull request.

In lib/reader.ts https://github.com/LibertyDSNP/parquetjs/pull/109#discussion_r1457919099 :

@@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH }; }

return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts)

.map((d:Array) => d.toString());

I think removing this has caused some of the test errors

— Reply to this email directly, view it on GitHub https://github.com/LibertyDSNP/parquetjs/pull/109#pullrequestreview-1830460046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJMWIFSZQ5MGLDXQOPSYUTYPF5B5AVCNFSM6AAAAABBPLE3N2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZQGQ3DAMBUGY . You are receiving this because you were mentioned.Message ID: @.***>

wilwade commented 10 months ago

@davidtsai Javascript is limited to 53 bit numbers, so doesn't it need to be something besides a native number for 96?

wilwade commented 9 months ago

Closing as stale. Please reopen if this is not so.

LibertyDSNP / parquetjs

Add support for INT96 columns and large decimal columns #109

Problem

Solution

Change summary:

Steps to Verify:

@.**** commented on this pull request.