apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

Misleading message when loading parquet data with invalid null data #33601

Open asfimport opened 1 year ago

asfimport commented 1 year ago

I'm saving an arrow table to parquet. One column is a list of structs, which elements are marked as non nullable. But the data isn't valid because I've put a null in one of the nested field. 

When I save this data to parquet and try to load it back I get a very misleading message:


 Length spanned by list offsets (2) larger than values array (length 1)

I would rather arrow complains when creating the table or when saving it to parquet.

Here's how to reproduce the issue:


struct = pa.struct(
    [
        pa.field("nested_string", pa.string(), nullable=False),
    ]
)

schema = pa.schema(
    [pa.field("list_column", pa.list_(pa.field("item", struct, nullable=False)))]
)
table = pa.table(
    {"list_column": [[{"nested_string": ""}, {"nested_string": None}]]}, schema=schema
)
with io.BytesIO() as file:
    pq.write_table(table, file)
    file.seek(0)
    pq.read_table(file) # Raises pa.ArrowInvalid

 

Reporter: &res / @0x26res

Note: This issue was originally created as ARROW-18439. Please see the migration documentation for further details.

asfimport commented 1 year ago

&res / @0x26res: As a general comment, it is quiet easy to create data that is not valid in terms of nullability in arrow. In the example above I was able to create a table where the nullability of the fields is not respected.

And, this would pass:


table.validate(full=True) 

But this would throw ArrowInvalid:


table.cast(table.schema)