[C++] Parquet writes broken file or incorrect data when nullable=False

asfimport commented 2 years ago

In such cases, when trying to write a pyarrow table to parquet with provided schema, and the provided schema contains a field with nullable=false, but contains an actual null value , the resulting parquet either

cannot be read or
the columns are somewhat get pushed up, and the whole table becomes inconsistent. The column changes by seemingly dropping the null value, and pushing together the complete dataset based on the provided row_group_size (and starting over from the start when runs out of values). Different row group sizes will lead to different results as well. This off-by-one problem is persistent in a single row group, the next could be perfectly fine if it contains no null values.

I believe, none of these behaviours are intentional, but they easily overseen by the user as one might think that providing a schema with constraints would lead to at least a warning/ (or better) an exception when writing the file. Using provided validation methods also see no problem with this particular problem.

You can find a snippet below explaining this weird behaviour. {code:java} import pyarrow as pa import pyarrow.parquet as pq

field_name = 'a_string' schema = pa.schema([ pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable ])

Arrow Columnar Format doesn't care if a non-nullable field holds a null t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK t_out.validate(full=True) # OK t_out.cast(schema, safe=True) # OK
Parquet writing does not raise, but silently kills the null string
because of the REQUIRED-ness of the field in the schema.
Then you either cannot read the parquet back, or the returned data
is invented, depending on the written row_group_size.

pq.write_table(t_out, where='pq_1', row_group_size=1) pq.read_table('pq_1')
-> OSError: Unexpected end of stream

pq.write_table(t_out, where='pq_2', row_group_size=2) pq.read_table('pq_2')
-> OSError: Unexpected end of stream
-> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds

pq.write_table(t_out, where='pq_3', row_group_size=3) print(pq.read_table('pq_3')[field_name])
-> [["0","1","0"],["3","4"]]

pq.write_table(t_out, where='pq_4', row_group_size=4) print(pq.read_table('pq_4')[field_name])
-> [["0","1","3","0"],["4"]]

pq.write_table(t_out, where='pq_5', row_group_size=5) print(pq.read_table('pq_5')[field_name])
-> [["0","1","3","4","0"]]{code}

Reporter: Rácz Dániel

Related issues:

[C++] Check nullability when validating fields on batches or struct arrays (is related to)

_{Note: This issue was originally created as ARROW-15899. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Hmm, we should probably check the nullable flag when validating (currently we don't).

asfimport commented 2 years ago

Rácz Dániel: Hi, is there any chance that this bug will get fixed anytime soon?

apache / arrow

[C++] Parquet writes broken file or incorrect data when nullable=False #31329

Related issues: