LibertyDSNP / parquetjs

Fully asynchronous, pure JavaScript implementation of the Parquet file format with additional features
MIT License
55 stars 25 forks source link

Fails to correctly read binary data from dictionary pages #131

Closed jwoyame closed 2 months ago

jwoyame commented 5 months ago

When reading a binary column, like the spatial column of a GeoParquet dataset, the values will be damaged in a way that cannot be reversed.

Steps to reproduce

The data-multipolygon-encoding_wkb.parquet test file for GeoParquet has binary columns that correspond to the values in the text document data-multipolygon-wkt.csv.

The second geometry column value should have the binary equivalent of this geometry:

MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5)))

Instead, the decimal value 45 comes out as 45.9765625 when read through this package. This happens to multiple values, but taking 45 as an example,

Correct hex value for coordinate pair (45, 40):

00000000 00804640 00000000 00004440

Actual hex value for coordinate pair: 00000000 00fd4640 00000000 00004440

Notice the value change from 80 -> fd, which happens because the package is translating binary values to strings and hex value 80 is outside the ASCII range (it happens to be the continuation byte).

Expected behaviour

The output binary should match the on-disk binary content for the column.

Actual behaviour

The values are getting modified because the Parquet file uses a dictionary page, and this package is converting all dictionary page values to strings. The decodeDictionaryPage function has:

    return decodeValues(opts.column.primitiveType, opts.column.encoding, dictCursor, (header.dictionary_page_header).num_values, opts)
        .map((d) => d.toString());

Notice the second line that maps all values to strings. Because converting binary to UTF-8 and back is lossy, there is no way to recover the true data.

If I comment this line out, the GeoParquet file loads correctly.

After diagnosing this issue, I noticed that #121 seems to be related.

eawooten commented 2 months ago

We are also experiencing this issue in our implementation of the library here. Thanks for the work being done.

wilwade commented 2 months ago

Should be fixed in https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.8.0