[Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema

asfimport commented 4 years ago

Originally mentioned in: https://github.com/apache/arrow/issues/6243

High level description of the issue:

It is possible (though not documented) that you may assign the column_types field of ConvertOptions to a Schema object instead of a Dict[str, DataType].
Expected result: the nullable attribute, in addition to the type, of the Fields in the Schema supplied are present on the Schema used when reading CSV data.
Actual result: the Field type information is present, but nullable is lost. All fields are nullable.

Minimal reproduction case:

Use case notes: this is especially noticeable when using pyarrow as a meant to save data with a known schema to parquet as the ParquetWriter will check that the schema of a table being written matches the schema supplied to the writer. If that same schema is used to to read the CSV data and contains a nullable field, a mismatch will be detected resulting in an error which is demonstrated below.


$ cat test.csv 
0
1
$ python
>>> import pyarrow
>>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(), nullable=False)])
>>> read_options = csv.ReadOptions(column_names=["foo"])
>>> from pyarrow import csv
>>> read_options = csv.ReadOptions(column_names=["foo"])
>>> convert_options = csv.ConvertOptions(column_types=schema)
>>> table = csv.read_csv("test.csv", convert_options=convert_options, read_options=read_options)
>>> schema
foo: bool not null
>>> table.schema
foo: bool
>>> from pyarrow import parquet as pq
>>> writer = pq.ParquetWriter("test.parquet", schema)
>>> writer.write_table(table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py", line 472, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file: 
table:
foo: bool vs. 
file:
foo: bool not null
>>> pyarrow.__version__
'0.15.1'
>>> exit()
$ python --version
Python 3.7.4

As a side note: if I don't set column_names in read_options when calling read_csv, but I set convert_options with column_types set, type inference is still performed which seems like a bug vs. what the docs state. That seems like a possibly related, but independent bug, and I haven't done a search yet to see if it is an open/known issue but if someone believes it should be filed with a repro case upon reading this I am happy to help! I only realized this when minimizing the repro case as my original code was setting column_names.

Potential source of issue:

**I did not yet look at how hard it is to fix, but I note that here only the name and type are passed down from a Field.

Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python 3.7.4. Reporter: Tim Lantz

_{Note: This issue was originally created as ARROW-7655. Please see the migration documentation for further details.}

asfimport commented 4 years ago

Tim Lantz: Re: my side note above, I filed https://issues.apache.org/jira/browse/ARROW-7656 as well. I see that in ARROW-6536 there is discussion on why in the C++ API you need to set both and that makes perfect sense so this is just a documentation thing.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Currently, I think the column_types option is only meant to specify the types, while nullability is part of the Field in a Schema, and is not a fundamental property of the type itself.

apache / arrow

[Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema #23903