LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
55 stars 25 forks source link

Feat/support aws s3 v3 #115

Closed shannonwells closed 10 months ago

shannonwells commented 10 months ago

Problem

Support AWS S3 V3 streams while retaining support for V2. V2 may be removed later.

Closes #32

with @wilwade , @pfrank13

Solution

Diverge when the stream looks like an AWS V3 stream and handle accordingly. I mostly used @pfrank13 's code workaround.

Change summary:

Steps to Verify:

  1. Tests should all pass (can do red/green taking changes out if you like)
  2. Verify with a real AWS V3 stream that it works, assuming you have an S3 bucket with credentials. Example code:
import { S3Client} from '@aws-sdk/client-s3';
import { ParquetReader } from "@dsnp/parquetjs";

const main = async () => {
  const s3 = new S3Client({
    region: 'us-west-1',
    credentials: {
      accessKeyId: 'asdfkldfsjlfdsjkl',
      secretAccessKey: 'dsfjkfsdjklfsjkl',
    }
  });
  const Bucket = 'foo';
  const Key = 'bar.parquet';

  let reader = await ParquetReader.openS3(s3, {Key, Bucket});

  console.log(reader.envelopeReader?.metadata)
}

main().catch(console.error).finally(process.exit);

You should see output like:

{
  version: 1,
  schema: [
    {
      type: null,
      type_length: null,
      repetition_type: null,
      name: 'm',
      num_children: 4,
      converted_type: null,
      scale: null,
      precision: null,
      field_id: null,
      logicalType: null
    },
    {
      type: 1,
      type_length: null,
      repetition_type: 1,
      name: 'nation_key',
      num_children: null,
      converted_type: null,
      scale: null,
      precision: null,
      field_id: null,
      logicalType: null
    },
    {
      type: 6,
      type_length: null,
      repetition_type: 1,
      name: 'name',
      num_children: null,
      converted_type: null,
      scale: null,
      precision: null,
      field_id: null,
      logicalType: null
    },
    {
      type: 1,
      type_length: null,
      repetition_type: 1,
      name: 'region_key',
      num_children: null,
      converted_type: null,
      scale: null,
      precision: null,
      field_id: null,
      logicalType: null
    },
    {
      type: 6,
      type_length: null,
      repetition_type: 1,
      name: 'comment_col',
      num_children: null,
      converted_type: null,
      scale: null,
      precision: null,
      field_id: null,
      logicalType: null
    }
  ],
  num_rows: { buffer: <Buffer 00 00 00 00 00 00 00 19>, offset: 0 },
  row_groups: [
    {
      columns: [Array],
      total_byte_size: [Object],
      num_rows: [Object],
      sorting_columns: null,
      file_offset: null,
      total_compressed_size: null,
      ordinal: null
    }
  ],
  key_value_metadata: null,
  created_by: 'parquet-mr',
  column_orders: null,
  encryption_algorithm: null,
  footer_signing_key_metadata: null
}