Instead, the decimal value 45 comes out as 45.9765625 when read through this package. This happens to multiple values, but taking 45 as an example,
Correct hex value for coordinate pair (45, 40):
00000000 00804640 00000000 00004440
Actual hex value for coordinate pair:
00000000 00fd4640 00000000 00004440
Notice the value change from 80 -> fd, which happens because the package is translating binary values to strings and hex value 80 is outside the ASCII range (it happens to be the continuation byte).
Expected behaviour
The output binary should match the on-disk binary content for the column.
Actual behaviour
The values are getting modified because the Parquet file uses a dictionary page, and this package is converting all dictionary page values to strings. The decodeDictionaryPage function has:
Notice the second line that maps all values to strings. Because converting binary to UTF-8 and back is lossy, there is no way to recover the true data.
If I comment this line out, the GeoParquet file loads correctly.
After diagnosing this issue, I noticed that #121 seems to be related.
When reading a binary column, like the spatial column of a GeoParquet dataset, the values will be damaged in a way that cannot be reversed.
Steps to reproduce
The data-multipolygon-encoding_wkb.parquet test file for GeoParquet has binary columns that correspond to the values in the text document data-multipolygon-wkt.csv.
The second geometry column value should have the binary equivalent of this geometry:
Instead, the decimal value
45
comes out as45.9765625
when read through this package. This happens to multiple values, but taking 45 as an example,Correct hex value for coordinate pair (45, 40):
00000000 00804640 00000000 00004440
Actual hex value for coordinate pair:
00000000 00fd4640 00000000 00004440
Notice the value change from
80
->fd
, which happens because the package is translating binary values to strings and hex value80
is outside the ASCII range (it happens to be the continuation byte).Expected behaviour
The output binary should match the on-disk binary content for the column.
Actual behaviour
The values are getting modified because the Parquet file uses a dictionary page, and this package is converting all dictionary page values to strings. The
decodeDictionaryPage
function has:Notice the second line that maps all values to strings. Because converting binary to UTF-8 and back is lossy, there is no way to recover the true data.
If I comment this line out, the GeoParquet file loads correctly.
After diagnosing this issue, I noticed that #121 seems to be related.