In such cases, when trying to write a pyarrow table to parquet with provided schema, and the provided schema contains a field with nullable=false, but contains an actual null value , the resulting parquet either
cannot be read or
the columns are somewhat get pushed up, and the whole table becomes inconsistent. The column changes by seemingly dropping the null value, and pushing together the complete dataset based on the provided row_group_size (and starting over from the start when runs out of values). Different row group sizes will lead to different results as well. This off-by-one problem is persistent in a single row group, the next could be perfectly fine if it contains no null values.
I believe, none of these behaviours are intentional, but they easily overseen by the user as one might think that providing a schema with constraints would lead to at least a warning/ (or better) an exception when writing the file. Using provided validation methods also see no problem with this particular problem.
You can find a snippet below explaining this weird behaviour.
{code:java}
import pyarrow as pa
import pyarrow.parquet as pq
Arrow Columnar Format doesn't care if a non-nullable field holds a null
t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK
t_out.validate(full=True) # OK
t_out.cast(schema, safe=True) # OK
Parquet writing does not raise, but silently kills the null string
because of the REQUIRED-ness of the field in the schema.
Then you either cannot read the parquet back, or the returned data
is invented, depending on the written row_group_size.
In such cases, when trying to write a pyarrow table to parquet with provided schema, and the provided schema contains a field with
nullable=false
, but contains an actual null value , the resulting parquet eitherthe columns are somewhat get
pushed up
, and the whole table becomes inconsistent. The column changes by seemingly dropping the null value, and pushing together the complete dataset based on the provided row_group_size (and starting over from the start when runs out of values). Different row group sizes will lead to different results as well. This off-by-one problem is persistent in a single row group, the next could be perfectly fine if it contains no null values.I believe, none of these behaviours are intentional, but they easily overseen by the user as one might think that providing a schema with constraints would lead to at least a warning/ (or better) an exception when writing the file. Using provided validation methods also see no problem with this particular problem.
You can find a snippet below explaining this weird behaviour. {code:java} import pyarrow as pa import pyarrow.parquet as pq
field_name = 'a_string' schema = pa.schema([ pa.field(name=field_name, type=pa.string(), nullable=False) # not nullable ])
Arrow Columnar Format doesn't care if a non-nullable field holds a null t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK t_out.validate(full=True) # OK t_out.cast(schema, safe=True) # OK
Parquet writing does not raise, but silently kills the null string
because of the REQUIRED-ness of the field in the schema.
Then you either cannot read the parquet back, or the returned data
is invented, depending on the written row_group_size.
pq.write_table(t_out, where='pq_1', row_group_size=1) pq.read_table('pq_1')
-> OSError: Unexpected end of stream
pq.write_table(t_out, where='pq_2', row_group_size=2) pq.read_table('pq_2')
-> OSError: Unexpected end of stream
-> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds
pq.write_table(t_out, where='pq_3', row_group_size=3) print(pq.read_table('pq_3')[field_name])
-> [["0","1","0"],["3","4"]]
pq.write_table(t_out, where='pq_4', row_group_size=4) print(pq.read_table('pq_4')[field_name])
-> [["0","1","3","0"],["4"]]
pq.write_table(t_out, where='pq_5', row_group_size=5) print(pq.read_table('pq_5')[field_name])
-> [["0","1","3","4","0"]]{code}
Reporter: Rácz Dániel
Related issues:
Note: This issue was originally created as ARROW-15899. Please see the migration documentation for further details.