apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++] Parquet writes broken file or incorrect data when nullable=False #31329

Open asfimport opened 2 years ago

asfimport commented 2 years ago

In such cases, when trying to write a pyarrow table to parquet with provided schema, and the provided schema contains a field with nullable=false, but contains an actual null value , the resulting parquet either

  1. Arrow Columnar Format doesn't care if a non-nullable field holds a null t_out = pa.table([['0', '1', None, '3', '4']], schema=schema) # OK t_out.validate(full=True) # OK t_out.cast(schema, safe=True) # OK

  2. Parquet writing does not raise, but silently kills the null string

  3. because of the REQUIRED-ness of the field in the schema.

  4. Then you either cannot read the parquet back, or the returned data

  5. is invented, depending on the written row_group_size.

    pq.write_table(t_out, where='pq_1', row_group_size=1) pq.read_table('pq_1')

  6. -> OSError: Unexpected end of stream

    pq.write_table(t_out, where='pq_2', row_group_size=2) pq.read_table('pq_2')

  7. -> OSError: Unexpected end of stream

  8. -> or sometimes: pyarrow.lib.ArrowInvalid: Index not in dictionary bounds

    pq.write_table(t_out, where='pq_3', row_group_size=3) print(pq.read_table('pq_3')[field_name])

  9. -> [["0","1","0"],["3","4"]]

    pq.write_table(t_out, where='pq_4', row_group_size=4) print(pq.read_table('pq_4')[field_name])

  10. -> [["0","1","3","0"],["4"]]

    pq.write_table(t_out, where='pq_5', row_group_size=5) print(pq.read_table('pq_5')[field_name])

  11. -> [["0","1","3","4","0"]]{code}

Reporter: Rácz Dániel

Related issues:

Note: This issue was originally created as ARROW-15899. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Hmm, we should probably check the nullable flag when validating (currently we don't).

asfimport commented 2 years ago

Rácz Dániel: Hi, is there any chance that this bug will get fixed anytime soon?