geoarrow / geoarrow-r

Extension types for geospatial data for use with 'Arrow'
http://geoarrow.org/geoarrow-r/
Apache License 2.0
148 stars 6 forks source link

point-default.parquet is not readable with pyarrow / arrow C++ #3

Closed rouault closed 2 years ago

rouault commented 2 years ago
>>> import pyarrow.parquet as pq
>>> pq.read_table('inst/example_parquet/point-default.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/even/arrow/cpp/build/myvenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0

On a OGR Parquet driver I'm developing, I can also reproduce the same issue with NULL Point. It seems that the Arrow C++ library doesn't correctly handle writing (or reading ? I'm not sure which side is broken) a NULL entry for a FixedSizeList in the Parquet format (this works correctly for Feather). The workaround I found is to write a POINT EMPTY instead of a NULL entry.

paleolimbot commented 2 years ago

Thanks! I believe this is a known issue ( https://issues.apache.org/jira/browse/ARROW-8228 ), whose fix has been bumped for a few years... @jorisvandenbossche likely knows more! I imagine we can get this fixed for the 8.0.0 release since it's important to this (I think the fixed-size list might not have gotten a lot of action in the past).

jorisvandenbossche commented 2 years ago

That JIRA is about writing, but so @paleolimbot since you are still using the R package (and thus the Arrow C++ parquet implementation) to write the actual parquet file, how did this work?

paleolimbot commented 2 years ago

I think it's an error in the writing (in the sense that files are written without error that cannot be read), but am happy to look into it.

library(geoarrow)
library(arrow, warn.conflicts = FALSE)

geoarrow_example_wkt[["point"]]
#> <wk_wkt[3]>
#> [1] POINT (30 10)  POINT EMPTY    <null feature>
(geom_arrow <- geoarrow_example_Array("point"))
#> FixedSizeListArray
#> <fixed_size_list<xy: double>[2]>
#> [
#>   [
#>     30,
#>     10
#>   ],
#>   [
#>     nan,
#>     nan
#>   ],
#>   null
#> ]

temp <- tempfile()
write_parquet(arrow_table(geom = geom_arrow), temp)
read_parquet(temp)
#> Error: Invalid: Expected all lists to be of size=2 but index 3 had size=0
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:624  AssembleArray(std::move(data))
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:109  BuildArray(batch_size, out)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:1180  ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:1161  fut.MoveResult()

Created on 2022-03-22 by the reprex package (v2.0.1)

jorisvandenbossche commented 2 years ago

Ah, it seems the writing only "accidentally" works (or doesn't error) in this example because the null is at the end:

import pyarrow as pa
import pyarrow.parquet as pq

>>> arr = pa.array([[1, 2], None, [3, 4]], pa.list_(pa.int64(), 2))
>>> pq.write_table(pa.table({"col": arr}), "test.parquet")
...
ArrowNotImplementedError: Lists with non-zero length null components are not supported

>>> arr = pa.array([[1, 2], [3, 4], None], pa.list_(pa.int64(), 2))
>>> pq.write_table(pa.table({"col": arr}), "test.parquet")
>>> pq.read_table("test.parquet")
...
ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0

So in general both reading and writing nulls doesn't yet work for FixedSizeList I think.

jorisvandenbossche commented 2 years ago

It seems there is already another JIRA open for the reading side as well: https://issues.apache.org/jira/browse/ARROW-9796