LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
55 stars 25 forks source link

bug?: reading decimal output string #121

Closed vmarchaud closed 2 months ago

vmarchaud commented 8 months ago

Thanks for reporting an issue!

Steps to reproduce

'use strict';
const chai = require('chai');
const assert = chai.assert;
const parquet = require('../parquet');
const path = require('path');

describe('decimal encoding', async function() {
  it('should works', async function() {
    let reader =  await parquet.ParquetReader.openFile(path.resolve(__dirname,'test-files/decimal.parquet'));
    let cursor = reader.getCursor(['age', 'id', 'full_name']);
    let records = [];
    let record = null

    while (record = await cursor.next()) {
      records.push(record);
    }
    assert.deepEqual(records,[{
      id: '10',
      age: 42,
      full_name: 'Jonathan Cohen'
    },
    {
      id: '11',
      age: 3,
      full_name: 'Joseph Hazan'
    }]);
  });
});

Schema:

image

File: e2e_datasources.bigquery_test_c40ff3c5-03f4-4213-9d52-fc62e71af0ed_1710089629994_file-000000000000.parquet.gz

Expected behaviour

We should decode age as number or at least as a buffer

Actual behaviour

AssertionError: expected [ { id: '10', …(2) }, …(1) ] to deeply equal [ { id: '10', age: 42, …(1) }, …(1) ]
      + expected - actual

       [
         {
      -    "age": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\t�e$\u0000"
      +    "age": 42
           "full_name": "Jonathan Cohen"
           "id": "10"
         }
         {
      -    "age": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000��^\u0000"
      +    "age": 3
           "full_name": "Joseph Hazan"
           "id": "11"
         }
       ]

Any other comments?

I'm not familiar with parquet encodings, actually started working with it this afternoon so i might be doing something wrong. I would expect to have the number decoded from decimal however i've seen in other tests that since decimal are encoded as a FIXED_LEN_BYTE_ARRAY in my case it should be decoded as a buffer but that's not the case either.

wilwade commented 8 months ago

First off, thanks for the test and including the file!

I would expect those to come out as buffers as well right now. They are FIXED_LEN_BYTE_ARRAY under the hood.

The other output option would be strings as JS only supports up to 53 bit numbers.

Looks like the issue is because this file uses a dictionary and dictionaries get a "toString" (wrongly) applied: https://github.com/LibertyDSNP/parquetjs/blame/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L948

However removing that looks like it causes some other tests to fail, so some version of that is needed for some values.

All the failing tests however are in the test-files.js test, so perhaps some of them are wrong? I might be able to take a deeper look in a few weeks, but perhaps that is enough that you can find the deeper issue faster than I will be able to.

harryalaw commented 2 months ago

I've started to have a look into the FIXED_LEN_BYTE_ARRAY handling as I've been looking into using this library and found the same issue with decimal parsing.

If you remove that toString call on the dictionary decoding the test failures seem to be just in the test-files.js:

I'm not familiar with the parquet format details and what the distinction is between the DATA_PAGE, DATA_PAGE_V2 and DICTIONARY_PAGE pagetypes, so I'm not sure whether the other routes were returning non-string data. I'd imagine it's a breaking change to update how the data is returned for this dictionary page case though, and what the wider impacts would be to any of the sample file parsing.

If getting rid of the toString conversion seems like the way to go, I'm happy to raise a PR that updates some of these tests to use the underlying type to see the change.

It could be worth adding an equivalent test like the test.parquet that checks the data gets returned as the expected types, but I'd need to figure out how to make a parquet file that uses a dictionary for that.

harryalaw commented 2 months ago

Did a bit more digging, and saw the parquet-testing files. It looks like this library is decoding parquet files with dictionary pages differently to non-dictionary pages.

If you compare alltypes_tiny_pages_plain.parquet and alltypes_tiny_pages.parquet, one gets returned with string values and the other gets returned with non-string values.

I just added a console.log(record) to the read-all.test.ts after filtering down to these two files: The first record from alltypes_tiny_pages.parquet

{
  id: 122,
  bool_col: true,
  tinyint_col: '2',
  smallint_col: '2',
  int_col: '2',
  bigint_col: '20',
  float_col: '2.200000047683716',
  double_col: '20.2',
  date_string_col: '01/13/09',
  string_col: '2',
  timestamp_col: '3725410000000',
  year: '2009',
  month: '1'
}

The first record from alltypes_tiny_pages_plain.parquet

{
  id: 122,
  bool_col: true,
  tinyint_col: 2,
  smallint_col: 2,
  int_col: 2,
  bigint_col: 20n,
  float_col: 2.200000047683716,
  double_col: 20.2,
  date_string_col: '01/13/09',
  string_col: '2',
  timestamp_col: 3725410000000,
  year: 2009,
  month: 1
}

And looks like the former included a dictionary page and the latter did not.

And then re-running it without that toString mapping makes them give the same result.

wilwade commented 2 months ago

Should be fixed in https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.8.0