LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
55 stars 25 forks source link

Add support for INT96 columns and large decimal columns #109

Closed davidtsai closed 9 months ago

davidtsai commented 10 months ago

Problem

When reading parquet files:

Solution

Change summary:

Steps to Verify:

  1. A setup step / beginning state
  2. What to do next
  3. Any other instructions
  4. Expected behavior
  5. Suggestions for testing
davidtsai commented 10 months ago

PR likely still needs tests etc to be mergeable, but contains proof of concept to parse parquet files with large integers/decimals properly.

keen85 commented 10 months ago

@davidtsai, will this PR also bring support for parquet files where TIMESTAMP columns are encoded as INT96? Example file see here: part-00000-43831db6-19d5-4964-a8c8-cb8d6d1664b3-c000.snappy.parquet

davidtsai commented 10 months ago

@keen85 it's halfway there to being able to support it. Right now I'm handling it in our application code:

function parquetInt96DateToLuxon(int96: bigint, timezone?: string) {
  // Extract nanoseconds and Julian day number
  const nanoseconds = int96 & BigInt('0xFFFFFFFFFFFFFFFF'); // first 8 bytes
  const julianDay = int96 >> BigInt(64); // last 4 bytes

  // Julian day number for Unix epoch (January 1, 1970)
  const unixEpochJulianDay = BigInt(2440588);

  // Calculate the difference in days between the Julian day and the Unix epoch
  const daysSinceEpoch = julianDay - unixEpochJulianDay;

  // Convert days to milliseconds
  // 86400000 milliseconds in a day
  const millisecondsSinceEpoch = daysSinceEpoch * BigInt(86400000);

  // Convert nanoseconds to milliseconds and add to the Unix timestamp
  const totalMilliseconds = millisecondsSinceEpoch + (nanoseconds / BigInt(1000000));

  // Create a DateTime object in UTC
  const date = DateTime.fromMillis(Number(totalMilliseconds), { zone: 'utc' });

  if (timezone) {
    // // Convert to the specified timezone
    return date.setZone(timezone, { keepLocalTime: true });
  }
  return date;
}

I'm not sure when we can reliably assume the INT96 column is a date in a parquet file. If there is documented convention for that, would be easy enough to add the above function to this library itself.

keen85 commented 10 months ago

@davidtsai awesome, thanks for your work!

If there is documented convention for that

...not sure if this is proof enough 😅

wilwade commented 10 months ago

@davidtsai Thanks for the PR!

Enabled tests, and seeing some fail, but looking closer, I think they might have been bad tests before perhaps? Or might be a mix. Tests should now run for this PR automatically. (At least I think that's what I told GitHub)

Let me know if you want/need some help, although you likely understand this part of the codebase better than I now.

davidtsai commented 10 months ago

Yes, the test was looking for strings, so they would need to be updated to match numbers and bigints. It is a breaking API change though, but in my opinion the correct behavior to not always return strings. I can work on this PR more in about a week to get it over the finish line if it'll be helpful!

On Thu, Jan 18, 2024 at 11:57 AM Wil Wade @.***> wrote:

@.**** commented on this pull request.

In lib/reader.ts https://github.com/LibertyDSNP/parquetjs/pull/109#discussion_r1457919099 :

@@ -874,9 +874,13 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH }; }

  • return decodeValues(opts.column!.primitiveType!, opts.column!.encoding!, dictCursor, (header.dictionary_page_header!).num_values, opts)
  • .map((d:Array) => d.toString());

I think removing this has caused some of the test errors

— Reply to this email directly, view it on GitHub https://github.com/LibertyDSNP/parquetjs/pull/109#pullrequestreview-1830460046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJMWIFSZQ5MGLDXQOPSYUTYPF5B5AVCNFSM6AAAAABBPLE3N2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMZQGQ3DAMBUGY . You are receiving this because you were mentioned.Message ID: @.***>

wilwade commented 10 months ago

@davidtsai Javascript is limited to 53 bit numbers, so doesn't it need to be something besides a native number for 96?

wilwade commented 9 months ago

Closing as stale. Please reopen if this is not so.