ZJONSSON / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
34 stars 61 forks source link

Wrong value from a DOUBLE column with repeating values #48

Open muratcorlu opened 4 years ago

muratcorlu commented 4 years ago

I have a parquet file that has 10k records in it. It has 7 columns that are strings and 1 is Double. When I read this file and convert them to a sql query(batch insert) I realized that somewhere in the file, it starts to give a different value for this double column. My iteration code is very simple;

    while (record = await cursor.next()) {
      count++;
      if (queryData) {
        queryData += ',';
      }
      queryData += `("${record.someId}","${record.someId2}","${record.someId3}","${record.someId4}","${record.readDate}",${record.readValue},"${record.unit}")`;
    }

record.readValue is the double column. Parquet file is written with parquet-mr version 1.10.1. I couldn't find a clear correlation about wrong values. Here is screenshot from a diff of the result of same parquet file with has been read with a different reader and parquetjs-lite reader.

image

When same value starts repeating in actual data, parquetjs-lite reader starts using a different value(1542.3070...) then correct one. And that value is not a "random" value actually. It one of the values from document, but from another index(somewhere in previous rows).

I hope I could explain the issue. I tried to debug this problem in last 12 hours but couldn't find a clear cause yet. I only feel that this is something about repetition levels but can not confirm. It's an issue on our production currently. Even I started to write this function with Python just because of this. I hope this can be addressed properly and I can return back to JavaScript.

garyirick-rga commented 2 years ago

This might be fixed by this PR: https://github.com/ZJONSSON/parquetjs/pull/81