LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
43 stars 24 forks source link

bug?: reading decimal output string #121

Open vmarchaud opened 3 months ago

vmarchaud commented 3 months ago

Thanks for reporting an issue!

Steps to reproduce

'use strict';
const chai = require('chai');
const assert = chai.assert;
const parquet = require('../parquet');
const path = require('path');

describe('decimal encoding', async function() {
  it('should works', async function() {
    let reader =  await parquet.ParquetReader.openFile(path.resolve(__dirname,'test-files/decimal.parquet'));
    let cursor = reader.getCursor(['age', 'id', 'full_name']);
    let records = [];
    let record = null

    while (record = await cursor.next()) {
      records.push(record);
    }
    assert.deepEqual(records,[{
      id: '10',
      age: 42,
      full_name: 'Jonathan Cohen'
    },
    {
      id: '11',
      age: 3,
      full_name: 'Joseph Hazan'
    }]);
  });
});

Schema:

image

File: e2e_datasources.bigquery_test_c40ff3c5-03f4-4213-9d52-fc62e71af0ed_1710089629994_file-000000000000.parquet.gz

Expected behaviour

We should decode age as number or at least as a buffer

Actual behaviour

AssertionError: expected [ { id: '10', …(2) }, …(1) ] to deeply equal [ { id: '10', age: 42, …(1) }, …(1) ]
      + expected - actual

       [
         {
      -    "age": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\t�e$\u0000"
      +    "age": 42
           "full_name": "Jonathan Cohen"
           "id": "10"
         }
         {
      -    "age": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000��^\u0000"
      +    "age": 3
           "full_name": "Joseph Hazan"
           "id": "11"
         }
       ]

Any other comments?

I'm not familiar with parquet encodings, actually started working with it this afternoon so i might be doing something wrong. I would expect to have the number decoded from decimal however i've seen in other tests that since decimal are encoded as a FIXED_LEN_BYTE_ARRAY in my case it should be decoded as a buffer but that's not the case either.

wilwade commented 3 months ago

First off, thanks for the test and including the file!

I would expect those to come out as buffers as well right now. They are FIXED_LEN_BYTE_ARRAY under the hood.

The other output option would be strings as JS only supports up to 53 bit numbers.

Looks like the issue is because this file uses a dictionary and dictionaries get a "toString" (wrongly) applied: https://github.com/LibertyDSNP/parquetjs/blame/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L948

However removing that looks like it causes some other tests to fail, so some version of that is needed for some values.

All the failing tests however are in the test-files.js test, so perhaps some of them are wrong? I might be able to take a deeper look in a few weeks, but perhaps that is enough that you can find the deeper issue faster than I will be able to.