Closed rouault closed 2 years ago
Thanks! I believe this is a known issue ( https://issues.apache.org/jira/browse/ARROW-8228 ), whose fix has been bumped for a few years... @jorisvandenbossche likely knows more! I imagine we can get this fixed for the 8.0.0 release since it's important to this (I think the fixed-size list might not have gotten a lot of action in the past).
That JIRA is about writing, but so @paleolimbot since you are still using the R package (and thus the Arrow C++ parquet implementation) to write the actual parquet file, how did this work?
I think it's an error in the writing (in the sense that files are written without error that cannot be read), but am happy to look into it.
library(geoarrow)
library(arrow, warn.conflicts = FALSE)
geoarrow_example_wkt[["point"]]
#> <wk_wkt[3]>
#> [1] POINT (30 10) POINT EMPTY <null feature>
(geom_arrow <- geoarrow_example_Array("point"))
#> FixedSizeListArray
#> <fixed_size_list<xy: double>[2]>
#> [
#> [
#> 30,
#> 10
#> ],
#> [
#> nan,
#> nan
#> ],
#> null
#> ]
temp <- tempfile()
write_parquet(arrow_table(geom = geom_arrow), temp)
read_parquet(temp)
#> Error: Invalid: Expected all lists to be of size=2 but index 3 had size=0
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:624 AssembleArray(std::move(data))
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:109 BuildArray(batch_size, out)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:1180 ReadColumn(static_cast<int>(i), row_groups, reader.get(), &column)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/parquet/arrow/reader.cc:1161 fut.MoveResult()
Created on 2022-03-22 by the reprex package (v2.0.1)
Ah, it seems the writing only "accidentally" works (or doesn't error) in this example because the null is at the end:
import pyarrow as pa
import pyarrow.parquet as pq
>>> arr = pa.array([[1, 2], None, [3, 4]], pa.list_(pa.int64(), 2))
>>> pq.write_table(pa.table({"col": arr}), "test.parquet")
...
ArrowNotImplementedError: Lists with non-zero length null components are not supported
>>> arr = pa.array([[1, 2], [3, 4], None], pa.list_(pa.int64(), 2))
>>> pq.write_table(pa.table({"col": arr}), "test.parquet")
>>> pq.read_table("test.parquet")
...
ArrowInvalid: Expected all lists to be of size=2 but index 3 had size=0
So in general both reading and writing nulls doesn't yet work for FixedSizeList I think.
It seems there is already another JIRA open for the reading side as well: https://issues.apache.org/jira/browse/ARROW-9796
On a OGR Parquet driver I'm developing, I can also reproduce the same issue with NULL Point. It seems that the Arrow C++ library doesn't correctly handle writing (or reading ? I'm not sure which side is broken) a NULL entry for a FixedSizeList in the Parquet format (this works correctly for Feather). The workaround I found is to write a POINT EMPTY instead of a NULL entry.